Book of Abstracts - phase 14
Transcription
Book of Abstracts - phase 14
The Association for Literary and Lingustic Computing The Association for Computers and the Humanities Society for Digital Humanities — Société pour l’étude des médias interactifs Digital Humanities 2008 The 20th Joint International Conference of the Association for Literary and Linguistic Computing, and the Association for Computers and the Humanities and The 1st Joint International Conference of the Association for Literary and Linguistic Computing, the Association for Computers and the Humanities, and the Society for Digital Humanities — Société pour l’étude des médias interactifs University of Oulu, Finland 24 – 29 June, 2008 Conference Abstracts International Programme Committee • Espen Ore, National Library of Norway, Chair • Jean Anderson, University of Glasgow, UK • John Nerbonne, University of Groningen, The Netherlands • Stephen Ramsay, University of Nebraska, USA • Thomas Rommel, International Univ. Bremen, Germany • Susan Schreibman, University of Maryland, USA • Paul Spence, King’s College London, UK • Melissa Terras, University College London, UK • Claire Warwick, University College London, UK,Vice Chair Local organizers • Lisa Lena Opas-Hänninen, English Philology • Riikka Mikkola, English Philology • Mikko Jokelainen, English Philology • Ilkka Juuso, Electrical and Information Engineering • Toni Saranpää, English Philology • Tapio Seppänen, Electrical and Information Engineering • Raili Saarela, Congress Services Edited by • Lisa Lena Opas-Hänninen • Mikko Jokelainen • Ilkka Juuso • Tapio Seppänen ISBN: 978-951-42-8838-8 Published by English Philology University of Oulu Cover design: Ilkka Juuso, University of Oulu © 2008 University of Oulu and the authors. Digital Humanities 2008 _____________________________________________________________________________ Introduction On behalf of the local organizers I am delighted to welcome you to the 25th Joint International Conference of the Association for Literary and Linguistic Computing (ALLC) and the Association for Computers and the Humanities (ACH) at the University of Oulu. This is a very special year because this is also the 1st Joint International Conference of the ALLC, ACH and the Society for Digital Humanities – Société pour l’étude des medias interactifs, as this year the SDH-SEMI joined the Alliance of Digital Humanities Organizations (ADHO), and I wish to extend a special welcome to them. With over 16 000 students and 3000 staff members, the University of Oulu is one of the largest universities in Finland. Linnanmaa, the campus area where the conference is being held, not only houses the University, but also Technopolis, the largest technology centre in Finland, and VTT, the State Technical Research Centre. We hope that this strong influence of technology and the opportunities it affords for multidisciplinary research will be of particular interest to the conference delegates. Many people have contributed to putting together the conference programme. I wish to thank the International Programme Committee and particularly its Chair, Espen Ore, for all the hard work that they have put into making this conference a varied and interesting one. I also wish to thank John Unsworth and Ray Siemens, who passed on their experiences of the 2007 conference and helped us set up the conference management site. I wish to thank Riikka Mikkola, Mikko Jokelainen, Ilkka Juuso and Toni Saranpää, who have all contributed to making this conference happen. I also thank Tapio Seppänen of the Engineering Faculty for all his help and support. I hope you will find the conference stimulating and useful and that you enjoy meeting the fellow humanities computing scholars from all over the world. If this is your first time in northern Finland, I hope you like it and that it entices you to come and visit us again. Tervetuloa! — Welcome! Lisa Lena Opas-Hänninen Local Organizer _____________________________________________________________________________ III Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ IV Digital Humanities 2008 _____________________________________________________________________________ A Richness of Research and Knowledge One of the bad things one may experience in the Programme Committee for a conference under the ADHO umbrella, is that there are more good submissions than there is room for. As we read through the submissions for this year’s conference, it became clear that Humanities Computing or Digital Humanities as a field is growing up. The field is also growing, period. Ten years ago I was also Chair of the Programme Committee for the ALLC/ACH conference in Debrecen, Hungary. I am happy that there are now so many new names among the submittors. And there were many very good submissions for papers, panel and paper sessions and posters.The hosting institution, the University of Oulu, could luckily supply the rooms needed if we were to run up to four parallel sessions. With four parallel sessions it may seem that there is a larger possibility for a conflict of interests: one might simultaneously wish to attend two presentations - or more. But the only way to remove this problem would be to have a single strand conference with a duration of a couple of weeks. So since this, as usual, is a multiple strand conference, I hope the Book of Abstracts may be used as a guide for which presentations to listen to and, alternatively, which authors one might seek out for an informal talk during one of the breaks. I would like to thank all the members of the PC, and especially the Vice Chair, Claire Warwick, for all their work, which has made the work on the Programme much easier than it could have been. I would also like to thank both Sara Schmidt at the University of Illinois and the local administration group at the University of Oulu who have done a tremendous amount of work in preparing for the conference programme. And last but definitely least, my thanks go to the Local Organizer, Lisa Lena Opas-Hänninen, who has been a great aid in the programme work and who has done everything I can imagine to help us get the rooms, people and more sorted out. But there are also those without whom there would be no conference. First, all of you who have prepared submissions, and then all of you who are attending this conference: thank you very much, and welcome to Oulu! Espen S. Ore Chair, International Programme Committee _____________________________________________________________________________ V Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ VI Digital Humanities 2008 _____________________________________________________________________________ Table of Contents Introduction .................................................................................................................................III Lisa Lena Opas-Hänninen A Richness of Research and Knowledge.....................................................................................V Espen S. Ore Panels Service Oriented Computing in the Humanities 3 (SOCH3) ................................................ 1 John Bradley, Elpiniki Fragkouli, Allen Renear, Monica Schraefel,Tapio Seppänen The Homer Multitext Project..................................................................................................... 5 Casey Dué, Mary Ebbott, A. Ross Scaife,W. Brent Seales, Christopher Blackwell, Neel Smith, Dorothy Carr Porter, Ryan Baumann Defining an International Humanities Portal .......................................................................... 12 Neil Fraistat, Domenico Fiormonte, Ian Johnson, Jan Christoph Meister, John Unsworth Beyond Search: Literary Studies and the Digital Library ...................................................... 13 Matthew Jockers, Glen Worthey, Joe Shapiro, Sarah Allison Designing, Coding, and Playing with Fire: Innovations in Interdisciplinary Research Project Management ........................................... 16 Stephen Ramsay, Stéfan Sinclair, John Unsworth, Milena Radzikowska, Stan Ruecker Aspects of Sustainability in Digital Humanities ...................................................................... 21 Georg Rehm, Andreas Witt, Øyvind Eide, Christian-Emil Ore, Jon Holmen, Allen H. Renear, Richard Urban, Karen Wickett, Carole L. Palmer, David Dubin, Erhard Hinrichs, Marga Reis ALLC SESSION: e-Science: New collaborations between information technology and the humanities ........................ 29 David Robey, Stuart Dunn, Laszlo Hunyadi, Dino Buzzetti Understanding TEI(s): A Presentation and Discussion Session ............................................. 30 Susan Schreibman, Ray Siemens, Peter Boot, Arianna Ciula, James Cummings, Kurt Gärtner, Martin Holmes, John Walsh DH2008: ADHO Session ‘Digital resources in humanities research: Evidence of value (2)’ ................................................................................................................. 31 Harold Short, David Hoover, Lorna Hughes, David Robey, John Unsworth The Building Blocks of the New Electronic Book ................................................................... 32 Ray Siemens, Claire Warwick, Kirsten C. Uszkalo, Stan Ruecker SDH/SEMI panel:Text Analysis Developers’ Alliance (TADA) and T-REX ............................ 34 Stéfan Sinclair, Ray Siemens, Matt Jockers, Susan Schreibman, Patrick Juola, David Hoover, Jean-Guy Meunier, Dominic Forest Agora.Techno.Phobia.Philia2: Feminist Critical Inquiry, Knowledge Building, Digital Humanities ...................................... 35 Martha Nell Smith, Susan Brown, Laura Mandell, Katie King, Marilee Lindemann, Rebecca Krefting, Amelia S.Wong _____________________________________________________________________________ VII Digital Humanities 2008 _____________________________________________________________________________ Papers Unicode 5.0 and 5.1 and Digital Humanities Projects ............................................................ 41 Deborah Winthrop Anderson Variation of Style: Diachronic Aspect ....................................................................................... 42 Vadim Andreev Exploring Historical Image Collections with Collaborative Faceted Classification ............ 44 Georges Arnaout, Kurt Maly, Milena Mektesheva, Harris Wu, Mohammad Zubair Annotated Facsimile Editions: Defining Macro-level Structure for Image-Based Electronic Editions .................................. 47 Neal Audenaert, Richard Furuta CritSpace: Using Spatial Hypertext to Model Visually Complex Documents ..................... 50 Neal Audenaert, George Lucchese, Grant Sherrick, Richard Furuta Glimpses though the clouds: collocates in a new light ........................................................... 53 David Beavan The function and accuracy of old Dutch urban designs and maps. A computer assisted analysis of the extension of Leiden (1611) .......................................... 55 Jakeline Benavides, Charles van den Heuvel AAC-FACKEL and BRENNER ONLINE. New Digital Editions of Two Literary Journals ... 57 Hanno Biber, Evelyn Breiteneder, Karlheinz Mörth e-Science in the Arts and Humanities – A methodological perspective ............................... 60 Tobias Blanke, Stuart Dunn, Lorna Hughes, Mark Hedges An OWL-based index of emblem metaphors ......................................................................... 62 Peter Boot Collaborative tool-building with Pliny: a progress report ...................................................... 65 John Bradley How to find Mrs. Billington? Approximate string matching applied to misspelled names . 68 Gerhard Brey, Manolis Christodoulakis Degrees of Connection: the close interlinkages of Orlando .................................................. 70 Susan Brown, Patricia Clements, Isobel Grundy, Stan Ruecker, Jeffery Antoniuk, Sharon Balazs The impact of digital interfaces on virtual gender images .................................................... 72 Sandra Buchmüller, Gesche Joost, Rosan Chow Towards a model for dynamic text editions ............................................................................ 78 Dino Buzzetti, Malte Rehbein, Arianna Ciula, Tamara Lopez Performance as digital text: capturing signals and secret messages in a media-rich experience ...................................... 84 Jama S. Coartney, Susan L.Wiesner Function word analysis and questions of interpretation in early modern tragedy.............. 86 Louisa Connors _____________________________________________________________________________ VIII Digital Humanities 2008 _____________________________________________________________________________ Deconstructing Machine Learning: A Challenge for Digital Humanities.............................. 89 Charles Cooney, Russell Horton, Mark Olsen, Glenn Roe, Robert Voyer Feature Creep: Evaluating Feature Sets for Text Mining Literary Corpora. ........................ 91 Charles Cooney, Russell Horton, Mark Olsen, Glenn Roe, Robert Voyer Hidden Roads and Twisted Paths: Intertextual Discovery using Clusters, Classifications, and Similarities ............................... 93 Charles Cooney, Russell Horton, Mark Olsen, Glenn Roe, Robert Voyer A novel way for the comparative analysis of adaptations based on vocabulary rich text segments: the assessment of Dan Brown’s The Da Vinci Code and its translations .................. 95 Maria Csernoch Converting St Paul: A new TEI P5 edition of The Conversion of St Paul using stand-off linking ........................ 97 James C. Cummings ENRICHing Manuscript Descriptions with TEI P5 .................................................................. 99 James C. Cummings Editio ex machina - Digital Scholarly Editions out of the Box............................................. 101 Alexander Czmiel Talia: a Research and Publishing Environment for Philosophy Scholars ............................. 103 Stefano David, Michele Nucci, Francesco Piazza Bootstrapping Classical Greek Morphology .......................................................................... 105 Helma Dik, Richard Whaling Mining Classical Greek Gender .............................................................................................. 107 Helma Dik, Richard Whaling Information visualization and text mining: application to a corpus on posthumanism .... 109 Ollivier Dyens, Dominic Forest, Patric Mondou,Valérie Cools, David Johnston How Rhythmical is Hexameter: A Statistical Approach to Ancient Epic Poetry ............... 112 Maciej Eder TEI and cultural heritage ontologies ...................................................................................... 115 Øyvind Eide, Christian-Emil Ore, Joel Goldfield, David L. Hoover, Nathalie Groß, Christian Liedtke The New Middle High German Dictionary and its Predecessors as an Interlinked Compound of Lexicographical Resources ............................................................................. 122 Kurt Gärtner Domestic Strife in Early Modern Europe: Images and Texts in a virtual anthology .......... 124 Martin Holmes, Claire Carlin Rescuing old data: Case studies, tools and techniques ......................................................... 127 Martin Holmes, Greg Newton Digital Editions for Corpus Linguistics: A new approach to creating editions of historical manuscripts .......................................... 132 Alpo Honkapohja, Samuli Kaislaniemi,Ville Marttila, David L. Hoover _____________________________________________________________________________ IX Digital Humanities 2008 _____________________________________________________________________________ Term Discovery in an Early Modern Latin Scientific Corpus .............................................. 136 Malcolm D. Hyman Markup in Textgrid ................................................................................................................... 138 Fotis Jannidis,Thorsten Vitt Breaking down barriers: the integration of research data, notes and referencing in a Web 2.0 academic framework ......................................................................................... 139 Ian R. Johnson Constructing Social Networks in Modern Ireland (C.1750-c.1940) Using ACQ................ 141 Jennifer Kelly, John G. Keating Unnatural Language Processing: Neural Networks and the Linguistics of Speech ........... 143 William Kretzschmar Digital Humanities ‘Readership’ and the Public Knowledge Project .................................. 145 Caroline Leitch, Ray Siemens, Analisa Blake, Karin Armstrong, John Willinsky Using syntactic features to predict author personality from text ...................................... 146 Kim Luyckx,Walter Daelemans An Interdisciplinary Perspective on Building Learning Communities Within the Digital Humanities ............................................................................................................ 149 Simon Mahony The Middle English Grammar Corpus a tool for studying the writing and speech systems of medieval English ........................... 152 Martti Mäkinen Designing Usable Learning Games for the Humanities: Five Research Dimensions ........ 154 Rudy McDaniel, Stephen Fiore, Natalie Underberg, Mary Tripp, Karla Kitalong, J. Michael Moshell Picasso’s Poetry:The Case of a Bilingual Concordance........................................................ 157 Luis Meneses, Carlos Monroy, Richard Furuta, Enrique Mallen Exploring the Biography and Artworks of Picasso with Interactive Calendars and Timelines ......................................................................................................... 160 Luis Meneses, Richard Furuta, Enrique Mallen Computer Assisted Conceptual Analysis of Text: the Concept of Mind in the Collected Papers of C.S. Peirce ............................................... 163 Jean-Guy Meunier, Dominic Forest Topic Maps and Entity Authority Records: an Effective Cyber Infrastructure for Digital Humanities.................................................... 166 Jamie Norrish, Alison Stevenson 2D and 3D Visualization of Stance in Popular Fiction .......................................................... 169 Lisa Lena Opas-Hänninen,Tapio Seppänen, Mari Karsikas, Suvi Tiinanen TEI Analytics: a TEI Format for Cross-collection Text Analysis ........................................... 171 Stephen Ramsay, Brian Pytlik-Zillig _____________________________________________________________________________ X Digital Humanities 2008 _____________________________________________________________________________ The Dictionary of Words in the Wild ..................................................................................... 173 Geoffrey Rockwell,Willard McCarty, Eleni Pantou-Kikkou The TTC-Atenea System: Researching Opportunities in the Field of Art-theoretical Terminology ............................ 176 Nuria Rodríguez Re-Engineering the Tree of Knowledge: Vector Space Analysis and Centroid-Based Clustering in the Encyclopédie ..................... 179 Glenn Roe, Robert Voyer, Russell Horton, Charles Cooney, Mark Olsen, Robert Morrissey Assumptions, Statistical Tests, and Non-traditional Authorship Attribution Studies -- Part II ............................................ 181 Joseph Rudman Does Size Matter? A Re-examination of a Time-proven Method ........................................ 184 Jan Rybicki The TEI as Luminol: Forensic Philology in a Digital Age ...................................................... 185 Stephanie Schlitz A Multi-version Wiki ................................................................................................................ 187 Desmond Schmidt, Nicoletta Brocca, Domenico Fiormonte Determining Value for Digital Humanities Tools................................................................... 189 Susan Schreibman, Ann Hanlon Recent work in the EDUCE Project ....................................................................................... 190 W. Brent Seales, A. Ross Scaife “It’s a team if you use ‘reply all’”: An Exploration of Research Teams in Digital Humanities Environments .......................... 193 Lynne Siemens Doing Digital Scholarship ........................................................................................................ 194 Lisa Spiro Extracting author-specific expressions using random forest for use in the sociolinguistic analysis of political speeches ................................................................................................... 196 Takafumi Suzuki Gentleman in Dickens: A Multivariate Stylometric Approach to its Collocation .............. 199 Tomoji Tabata Video Game Avatar: From Other to Self-Transcendence and Transformation ................. 202 Mary L.Tripp Normalizing Identity:The Role of Blogging Software in Creating Digital Identity ........... 204 Kirsten Carol Uszkalo, Darren James Harkness A Modest proposal. Analysis of Specific Needs with Reference to Collation in Electronic Editions ................. 206 Ron Van den Branden _____________________________________________________________________________ XI Digital Humanities 2008 _____________________________________________________________________________ (Re)Writing the History of Humanities Computing ............................................................ 208 Edward Vanhoutte Knowledge-Based Information Systems in Research of Regional History ......................... 210 Aleksey Varfolomeyev, Henrihs Soms, Aleksandrs Ivanovs Using Wmatrix to investigate the narrators and characters of Julian Barnes’ Talking It Over .......................................................................................................................... 212 Brian David Walker The Chymistry of Isaac Newton and the Chymical Foundations of Digital Humanities .. 214 John A.Walsh Document-Centric Framework for Navigating Texts Online, or, the Intersection of the Text Encoding Initiative and the Metadata Encoding and Transmission Standard . 216 John A.Walsh, Michelle Dalmau iTrench: A Study of the Use of Information Technology in Field Archaeology ................... 218 Claire Warwick, Melissa Terras, Claire Fisher LogiLogi: A Webplatform for Philosophers ............................................................................ 221 Wybo Wiersma, Bruno Sarlo Clinical Applications of Computer-assisted Textual Analysis: a Tei Dream?........................ 223 Marco Zanasi, Daniele Silvi, Sergio Pizziconi, Giulia Musolino Automatic Link-Detection in Encoded Archival Descriptions ............................................ 226 Junte Zhang, Khairun Nisa Fachry, Jaap Kamps A Chinese Version of an Authorship Attribution Analysis Program .................................... 229 Mengjia Zhao, Patrick Juola Posters The German Hamlets: An Advanced Text Technological Application ................................. 233 Benjamin Birkenhake, Andreas Witt Fine Rolls in Print and on the Web: A Reader Study ............................................................ 235 Arianna Ciula,Tamara Lopez PhiloMine: An Integrated Environment for Humanities Text Mining .................................. 237 Charles Cooney, Russell Horton, Mark Olsen, Glenn Roe, Robert Voyer The Music Information Retrieval Evaluation eXchange (MIREX): Community-Led Formal Evaluations ..................................................................................... 239 J. Stephen Downie, Andreas F. Ehmann, Jin Ha Lee Online Collaborative Research with REKn and PReE .......................................................... 241 Michael Elkink, Ray Siemens, Karin Armstrong A Computerized Corpus of Karelian Dialects: Design and Implementation ..................... 243 Dmitri Evmenov _____________________________________________________________________________ XII Digital Humanities 2008 _____________________________________________________________________________ Digitally Deducing Information about the Early American History with Semantic Web Techniques ................................................................................................................................ 244 Ismail Fahmi, Peter Scholing, Junte Zhang TauRo - A search and advanced management system for XML documents...................... 246 Alida Isolani, Paolo Ferragina, Dianella Lombardini,Tommaso Schiavinotto The Orationes Project ............................................................................................................. 249 Anthony Johnson, Lisa Lena Opas-Hänninen, Jyri Vaahtera JGAAP3.0 -- Authorship Attribution for the Rest of Us ....................................................... 250 Patrick Juola, John Noecker, Mike Ryan, Mengjia Zhao Translation Studies and XML: Biblical Translations in Byzantine Judaism, a Case Study . 252 Julia G. Krivoruchko, Eleonora Litta Modignani Picozzi, Elena Pierazzo Multi-Dimensional Markup: N-way relations as a generalisation over possible relations between annotation layers ...................................................................................................... 254 Harald Lüngen, Andreas Witt Bibliotheca Morimundi.Virtual reconstitution of the manuscript library of the abbey of Morimondo ............................................................................................................................... 256 Ernesto Mainoldi Investigating word co-occurrence selection with extracted sub-networks of the Gospels258 Maki Miyake Literary Space: A New Integrated Database System for Humanities Research ............... 260 Kiyoko Myojo, Shin’ichiro Sugo A Collaboration System for the Philology of the Buddhist Study ...................................... 262 Kiyonori Nagasaki Information Technology and the Humanities: The Experience of the Irish in Europe Project ..................................................................... 264 Thomas O’Connor, Mary Ann Lyons, John G. Keating Shakespeare on the tree.......................................................................................................... 266 Giuliano Pascucci XSLT (2.0) handbook for multiple hierarchies processing .................................................. 268 Elena Pierazzo, Raffaele Viglianti The Russian Folk Religious Imagination ................................................................................ 271 Jeanmarie Rouhier-Willoughby, Mark Lauersdorf, Dorothy Porter Using and Extending FRBR for the Digital Library for the Enlightenment and the Romantic Period – The Spanish Novel (DLER-SN) ........................ 272 Ana Rueda, Mark Richard Lauersdorf, Dorothy Carr Porter How to convert paper archives into a digital data base? Problems and solutions in the case of the Morphology Archives of Finnish Dialects ....... 275 Mari Siiroinen, Mikko Virtanen,Tatiana Stepanova _____________________________________________________________________________ XIII Digital Humanities 2008 _____________________________________________________________________________ A Bibliographic Utility for Digital Humanities Projects....................................................... 276 James Stout, Clifford Wulfman, Elli Mylonas A Digital Chronology of the Fashion, Dress and Behavior from Meiji to early Showa periods(1868-1945) in Japan .................................................................................................... 278 Haruko Takahashi Knight’s Quest: A Video Game to Explore Gender and Culture in Chaucer’s England .... 280 Mary L.Tripp,Thomas Rudy McDaniel, Natalie Underberg, Karla Kitalong, Steve Fiore TEI by Example: Pedagogical Approaches Used in the Construction of Online Digital Humanities Tutorials ................................................................................................................ 282 Ron Van den Branden, Melissa Terras, Edward Vanhoutte CLARIN: Common Language Resources and Technology Infrastructure .......................... 283 Martin Wynne,Tamás Váradi, Peter Wittenburg, Steven Krauwer, Kimmo Koskenniemi Fortune Hunting: Art, Archive, Appropriation ....................................................................... 285 Lisa Young, James Stout, Elli Mylonas Index of Authors ....................................................................................................................... 287 _____________________________________________________________________________ XIV Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ XV Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ XVI Digital Humanities 2008 _____________________________________________________________________________ Service Oriented Computing in the Humanities 3 (SOCH3) One-Day Workshop at Digital Humanities 2008 Tuesday, 24 June, 2008 9:30 – 17:30 Organized jointly by King’s College London, and the University of Oulu The workshop is organized by Stuart Dunn (Centre for eResearch, King’s College), Nicolas Gold (Computing Science, King’s College), Lorna Hughes (Centre for e-Research, King’s College), Lisa Lena Opas-Hänninen (English Philology, University of Oulu) and Tapio Seppänen (Information Engineering, University of Oulu). Speakers John Bradley john.bradley@kcl.ac.uk King’s College London, UK Elpiniki Fragkouli elpiniki.fragkouli@kcl.ac.uk King’s College London, UK Since software services, and the innovative software architectures that they require, are becoming widespread in their use in the Digital Humanities, it is important to facilitate and encourage problem and solution sharing among different disciplines to avoid reinventing the wheel. This workshop will build on two previous Service Oriented Computing in the Humanities events held in 2006 and 2007(under the auspices of SOSERNET and the AHRC ICT Methods Network). The workshop is structured around four invited presentations from different humanities disciplines. These disciplines are concerned with (e.g.) archaeological data, textual data, the visual arts and historical information. The presentations will approach humanities data, and the various uses of it, from different perspectives at different stages in the research lifecycle. There will reflection on the issues that arise at the conception of a research idea, through to data gathering, analysis, collaboration and publication and dissemination. A further presentation from Computer Science will act as a ‘technical response’ to these papers, showcasing different tool types and how they can be applied to the kinds of data and methods discussed. The presentations will be interspersed with scheduled discussion sessions and The emphasis throughout is on collaboration, and what the humanities and computer science communities can learn from one another: do we have a common agenda, and how can we take this forward? The day will conclude with a moderated discussion that will seek to identify and frame that agenda, and form the basis of a report. Allen Renear renear@uiuc.edu University of Illinois, USA Monica Schraefel mc@ecs.soton.ac.uk University of Southampton, UK Tapio Seppänen tapio.seppanen@oulu.fi University of Oulu, Finland _____________________________________________________________________________ 1 Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ 2 Digital Humanities 2008 _____________________________________________________________________________ Panels _____________________________________________________________________________ 3 Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ 4 Digital Humanities 2008 _____________________________________________________________________________ The Homer Multitext Project Casey Dué casey@chs.harvard.edu University of Houston, USA Mary Ebbott ebbott@chs.harvard.edu College of the Holy Cross, USA A. Ross Scaife scaife@gmail.com University of Kentucky, USA W. Brent Seales seales@netlab.uky.edu University of Kentucky, USA Christopher Blackwell cwblackwell@gmail.com Furman University, USA Neel Smith dnsmith.neel@gmail.com College of the Holy Cross, USA Dorothy Carr Porter dporter@uky.edu University of Kentucky, USA Ryan Baumann rfbaumann@gmail.com University of Kentucky, USA The Homer Multitext Project (HMT) is a new edition that presents the Iliad and Odyssey within the historical framework of its oral and textual transmission. The project begins from the position that these works, although passed along to us as written sources, were originally composed orally over a long period of time. What one would usually call “variants” from a base text are in fact evidence of the system of oral composition in performance and exhibit the diversity to be expected from an oral composition. These variants are not well reflected in traditional modes of editing, which focus on the reconstruction of an original text. In the case of the works of Homer there is no “original text”. All textual variants need to be understood in the historical context, or contexts, in which they first came to be, and it is the intention of the HMT to make these changes visible both synchronically and diachronically. The first paper in our session is an introduction to these and other aspects of the HMT, presented by the editors of the project, Casey Dué and Mary Ebbott. Dué and Ebbott will discuss the need for a digital Multitext of the works of Homer, the potential uses of such an edition as we envision it, and the challenges in building the Multitext. The works of Homer are known to us through a variety of primary source materials.Although most students and scholars of the classics access the texts through traditional editions, those editions are only representations of existing material. A major aim of the HMT is to make available texts from a variety of sources, including high-resolution digital images of those sources. Any material which attests to a reading of Homer – papyrus fragment, manuscript, or inscription in stone – can and will be included in the ultimate edition. The HMT has begun incorporating images of sources starting with three important manuscripts: the tenth-century Marcianus Graecus Z. 454 (= 822), the eleventh-century Marcianus Graecus Z. 453 (= 821), and the twelfth/thirteenth-century Marcianus Graecus Z. 458 (= 841). Marcianus Graecus Z. 454, commonly called the “Venetus A,” is arguably the most important surviving copy of the Iliad, containing not only the full text of the poem but also several layers of commentary, or scholia, that include many variant readings of the text. The second paper in our session, presented by project collaborators W. Brent Seales and A. Ross Scaife, is a report on work carried out in Spring 2007 at the Biblioteca Nazionale Marciana to digitize these manuscripts, focusing on 3D imaging and virtual flattening in an effort to render the text (especially the scholia) more legible. Once text and images are compiled, they need to be published and made available for scholarly use, and to the wider public. The third paper, presented by the technical editors Christopher Blackwell and Neel Smith, will outline the protocols and software developed and under development to support the publication of the Multitext. Using technology that takes advantage of the best available practices and open source standards that have been developed for digital publications in a variety of fields, the Homer Multitext will offer free access to a library of texts and images, a machine-interface to that library and its indices, and tools to allow readers to discover and engage with the Homeric tradition. References: The Homer Multitext Project, description at the Center for Hellenic Studies website: http://chs.harvard.edu/chs/homer_ multitext Homer and the Papyri, http://chs.harvard.edu/chs/homer___ the_papyri_introduction Haslam, M. “Homeric Papyri and Transmission of the Text” in I. Morris and B. Powell, eds., A New Companion to Homer. Leiden, 1997. West, M. L. Studies in the text and transmission of the Iliad. München: K.G. Saur 2001 Digital Images of Iliad Manuscripts from the Marciana Library, First Drafts @ Classics@, October 26, 2007: http://zeus.chsdc. org/chs/manuscript_images _____________________________________________________________________________ 5 Digital Humanities 2008 _____________________________________________________________________________ Paper 1:The Homer Multitext Project: An Introduction Casey Dué and Mary Ebbott The Homer Multitext (HMT) of Harvard University’s Center for Hellenic Studies (CHS) in Washington, D.C., seeks to take advantage of the digital medium to give a more accurate visual representation of the textual tradition of Homeric epic than is possible on the printed page. Most significantly, we intend to reveal more readily the oral performance tradition in which the epics were composed, a tradition in which variation from performance to performance was natural and expected. The Homeric epics were composed again and again in performance: the digital medium, which can more readily handle multiple texts, is therefore eminently suitable for a critical edition of Homeric poetry—indeed, the fullest realization of a critical edition of Homer may require a digital medium. In other words, the careful construction of a digital library of Homeric texts, plus tools and accompanying material to assist in interpreting those texts, can help us to recover and better realize evidence of what preceded the written record of these epics. In this presentation we will explain how the oral, traditional poetry of these epics and their textual history call for a different editorial practice from what has previously been done and therefore require a new means of representation. We will also demonstrate how the Homer Multitext shares features with but also has different needs and goals from other digital editions of literary works that were composed in writing. Finally, we will discuss our goal of making these texts openly available so that anyone can use them for any research purpose, including purposes that we can’t ourselves anticipate. The oral traditional nature of Homeric poetry makes these texts that have survived different in important ways from texts that were composed in writing, even those that also survive in multiple and differing manuscripts. We have learned from the comparative fieldwork of Milman Parry and Albert Lord, who studied and recorded a living oral tradition during the 1930s and again in the 1950s in what was then Yugoslavia, that the Homeric epics were composed in performance during a long oral tradition that preceded any written version (Parry 1971, Lord 1960, Lord 1991, and Lord 1995; see also Nagy 1996a and Nagy 2002). In this tradition, the singer did not memorize a static text prior to performance, but would compose the song as he sang it. One of the most important revelations of the fieldwork of Parry and Lord is that every time the song is performed in an oral composition-in-performance tradition, it is composed anew. The singers themselves do not strive to innovate, but they nevertheless compose a new song each time (Dué 2002, 83-89). The mood of the audience or occasion of performance are just two factors that can influence the length of a song or a singer’s choices between competing, but still traditional, elements of plot. The term “variant,” as employed by textual critics when evaluating witnesses to a text, is not appropriate for such a compositional process. Lord explained the difference this way: “the word multiform is more accurate than ‘variant,’ because it does not give preference or precedence to any one word or set of words to express an idea; instead it acknowledges that the idea may exist in several forms” (Lord 1995, 23). Our textual criticism of Homeric epic, then, needs to distinguish what may genuinely be copying mistakes and what are performance multiforms: that is, what variations we see are very likely to be part of the system and the tradition in which these epics were composed (Dué 2001). Once we begin to think about the variations as parts of the system rather than as mistakes or corruptions, textual criticism of the Homeric texts can then address fresh questions. Some of the variations we see in the written record, for example, reveal the flexibility of this system. Where different written versions record different words, but each phrase or line is metrically and contextually sound, we must not necessarily consider one “correct” or “Homer” and the other a “mistake” or an “interpolation.” Rather, each could represent a different performance possibility, a choice that the singer could make, and would be making rapidly without reference to a set text (in any sense of that word). Yet it is difficult to indicate the parity of these multiforms in a standard critical edition on the printed page. One version must be chosen for the text on the upper portion of the page, and the other recorded variations must be placed in an apparatus below, often in smaller text, a placement that necessarily gives the impression that these variations are incorrect or at least less important. Within a digital medium, however, the Homer Multitext will be able to show where such variations occur, indicate clearly which witnesses record them, and allow users to see them in an arrangement that more intuitively distinguishes them as performance multiforms. Thus a digital criticism—one that can more readily present parallel texts—enables a more comprehensive understanding of these epics. An approach to editing Homer that embraces the multiformity of both the performative and textual phases of the tradition—that is to say, a multitextual approach—can convey the complexity of the transmission of Homeric epic in a way that is simply impossible on the printed page. We know that the circumstances of the composition of the Homeric epics demand a new kind of scholarly edition, a new kind of representation. We believe that a digital criticism of the witnesses to the poetry provides a means to construct one. Now the mission is to envision and create the Multitext so that it is true to these standards. The framework and tools that will connect these texts and make them dynamically useful are, as Peter Robinson has frankly stated (Robinson 2004 and Robinson 2005), the challenge that lies before us, as it does for other digital scholarly editions. How shall we highlight multiforms so as to make them easy to find and compare? We cannot simply have a list of variants, as some digital editions have done, for this would take them once again out of their context within the textual transmission and in many ways repeat the false impression that a printed apparatus gives. Breaking away from a print model is, as others have observed (Dahlström 2000, Robinson 2004), not as easy as it might seem, and digital editions have generally not succeeded yet in doing _____________________________________________________________________________ 6 Digital Humanities 2008 _____________________________________________________________________________ so. How do we guide users through these many resources in a structured but not overly pre-determined or static way? Some working toward digital editions or other digital collections promote the idea of the reader or user as editor (Robinson 2004, Ulman 2006, the Vergil Project at http://vergil.classics. upenn.edu/project.html). Yet Dahlström (2000, section 4) warned that a “hypermedia database exhibiting all versions of a work, enabling the user to choose freely between them and to construct his or her ‘own’ version or edition, presupposes a most highly competent user, and puts a rather heavy burden on him or her.” This type of digital edition “threatens to bury the user deep among the mass of potential virtuality.” Especially because we do not want to limit the usefulness of this project to specialized Homeric scholars only, a balance between freedom of choice and structured guidance is important. We are not alone in grappling with the questions, and the need to recognize and represent the oral, traditional nature of Homeric poetry provides a special challenge in our pursuit of the answers. References Dahlström, M. “Drowning by Versions.” Human IT 4 (2000). Available on-line at http://hb.se/bhs/ith/4-00/md.htm Dué, C. “Achilles’ Golden Amphora in Aeschines’ Against Timarchus and the Afterlife of Oral Tradition.” Classical Philology 96 (2001): 33-47. –––. Homeric Variations on a Lament by Briseis. Lanham, Md.: Rowman and Littlefield Press, 2002: http://chs.harvard.edu/ publications.sec/online_print_books.ssp/casey_du_homeric_ variations/briseis_toc.tei.xml_1 Dué, C. and Ebbott, M. “‘As Many Homers As You Please’: an On-line Multitext of Homer, Classics@ 2 (2004), C. Blackwell, R. Scaife, edd., http://chs.harvard.edu/classicsat/issue_2/dueebbott_2004_all.html Foley, J. The Singer of Tales in Performance. Bloomington, 1995. –––. Homer’s Traditional Art. University Park, 1999. Greg, W. W. The Shakespeare First Folio: Its Bibliographical and Textual History. London, 1955. Monroy, C., Kochumman, R., Furuta, R., Urbina, E., Melgoza, E., and Goenka, A. “Visualization of Variants in Textual Collations to Anaylze the Evolution of Literary Works in the The Cervantes Project.” Proceedings of the 6th European Conference on Research and Advanced Technology for Digital Libraries, 2002, pp. 638-653. http://www.csdl.tamu.edu/cervantes/pubs/ ecdl2002.pdf Nagy, G. Poetry as Performance. Cambridge, 1996. –––. Homeric Questions. Austin, TX, 1996. –––. Plato’s Rhapsody and Homer’s Music:The Poetics of the Panathenaic Festival in Classical Athens. Cambridge, Mass., 2002. Parry, A. ed., The Making of Homeric Verse. Oxford, 1971. Porter, D. “Examples of Images in Text Editing.” Proceedings of the 19th Joint International Conference of the Association for Computers and the Humanities, and the Association for Literary and Linguistic Computing, at the University of Illinois, UrbanaChampaign, 2007. http://www.digitalhumanities.org/dh2007/ abstracts/xhtml.xq?id=250 Robinson, P. “Where We Are with Electronic Scholarly Editions, and Where We Want to Be.” Jahrbuch für Computerphilologie Online 1.1 (2005) at http:// computerphilologie.uni-muenchen.de/jg03/robinson.html January 2004. In print in Jahrbuch für Computerphilologie 2004, 123-143. –––. “Current Issues in Making Digital Editions of Medieval texts—or, Do Electronic Scholarly Editions have a Future?” Digital Medievalist 1.1 (2005). http://www.digitalmedievalist. org/article.cfm?RecID=6. Stringer, G. “An Introduction to the Donne Variorum and the John Donne Society.” Anglistik 10.1 (March 1999): 85–95. Available on-line at http://donnevariorum.tamu.edu/anglist/ anglist.pdf. Ulman, H. L. “Will the Real Edition Please Stand Out?: Negotiating Multi-Linear Narratives encoded in Electronic Textual Editions.” Proceedings of the <Code> Conference, 2006. http://www.units.muohio.edu/codeconference/papers/papers/ Ulman-Code03.pdf Kiernan, K. “Digital Fascimilies in Editing: Some Guidelines for Editors of Image-based Scholarly Editions.” Electronic Textual Editing, ed. Burnard, O’Keeffe, and Unsworth. New York, 2005. Preprint at: http://www.tei-c.org/Activities/ETE/Preview/ kiernan.xml. Lord, A. B. The Singer of Tales. Cambridge, Mass., 1960. 2nd rev. edition, 2000. –––. Epic Singers and Oral Tradition. Ithaca, N.Y., 1991. –––. The Singer Resumes the Tale. Ithaca, N.Y., 1995. _____________________________________________________________________________ 7 Digital Humanities 2008 _____________________________________________________________________________ Paper 2: Imaging the Venetus A Manuscript for the Homer Multitext Ryan Baumann, W. Brent Seales and A. Ross Scaife In May, 2007, a collaborative group of librarians, scholars and technologists under the aegis of the Homer Multitext project (HMT) traveled to Venice, Italy to undertake the photography and 3D scanning of the Venetus A manuscript at the Biblioteca Nazionale Marciana. Manoscritto marciano Gr.Z.454(=822): Homer I. Ilias, or Venetus A, is a 10th century Byzantine manuscript, perhaps the most important surviving copy of the Iliad, and the one on which modern editions are primarily based. Its inclusion will have the utmost significance for the HMT edition. In addition to the Iliad text, Venetus A contains numerous layers of commentary, called scholia, which serve to describe particular aspects of the main text. The scholia are written in very small script (mostly illegible in the only available print facsimile (Comparetti 1901)), and their organization on the page appears as a patchwork of tiny interlinking texts (see figure 1). The scholia are what truly set this manuscript apart, and their inclusion is of the greatest importance for the HMT. For this reason, we included 3D scanning in our digitization plan, with the aim to provide digital flattening and restoration that may help render the text (especially the scholia) more legible than digital images on their own. In this presentation we will describe how we obtained both digital photographs and 3D information, and will show how we combined them to create new views of this manuscript. [Figure 1:This slice from Venetus A shows the main text, along with several layers of scholia: interlinear, marginal (both left and right), and intermarginal.] This particular manuscript presents an almost perfect case for the application of digital flattening in the context of an extremely important, highly visible work that is the basis for substantial and ongoing scholarship in the classics. In the thousand-plus years since its creation, the environment has taken its toll on the substrate, the parchment upon which the texts are written. Several different types of damage, including staining, water damage, and gelitination (degradation of the fiber structure of the parchment) affect different areas of the manuscript. Much more widespread, affecting each folio in the manuscript and interfering with the reading of the text, is the planarity distortion, or cockling, that permeates the manuscript. Caused by moisture, cockling causes the folios to wrinkle and buckle, rather than to lie flat (see figure 2). This damage has the potential to present difficulties for the scholars by obstructing the text, especially the scholia. Virtual flattening provides us with a method for flattening the pages, making the text easier to read, without physically forcing the parchment and potentially damaging the manuscript. [Figure 2: Cockling is visible along the fore edge of the Venetus A] Digital Flattening The procedure for digital flattening involves obtaining a 3D mesh of the object (a model of the shape of the folio), in addition to the 2D photograph. The photographic texture information can then be mapped on to the 3D mesh, and the 3D mesh can be deformed (or unwarped) to its original flat 2D shape. As the associated texture is deformed along with it, the original 2D texture can be recovered. For the digital flattening of the Venetus A, the project used the physical-based modeling approaches established in (Brown 2001, Brown 2004). The largest previous published application of this digital flattening technique was to BL Cotton Otho B. x, a significantly more damaged 11th century manuscript consisting of 67 folios (Kiernan 2002, Brown 2004). By contrast,Venetus A consists of 327 folios, a nearly five-fold increase.Thus, a system was needed which could apply the necessary algorithms for flattening without manual intervention, in an automated fashion.We also required a reliable system for imaging the document, acquiring 3D scans, and obtaining calibration data. Imaging: 2D and 3D Digital photographs were taken using a Hasselblad H1 with a Phase One P45 digital back, capable of a resolution of 5428x7230 (39.2 megapixels). The manuscript was held by a custom cradle built by Manfred Mayer from the University of Graz in Austria. The cradle enabled the photography team to hold the book open at a fixed angle and minimize stress on the binding, and consisted of a motorized cradle and camera gantry perpendicular to the page being imaged. The camera was operated via FireWire, to reduce the need for manual adjustments. To acquire data for building the 3D mesh, we used a FARO 3D Laser ScanArm. Since we were unlikely to be able to use a flat surface near the cradle to mount the arm, we set the arm on a tripod with a custom mount. The ScanArm consists of a series of articulating joints, a point probe, and a laser line probe device attachment. The system operates by calibrating the point probe to the arm, such that for any given orientation of joints, the arm can reliably report the probe’s position in 3D space. Another calibration procedure calibrates the point _____________________________________________________________________________ 8 Digital Humanities 2008 _____________________________________________________________________________ probe to the laser line probe. The result is a non-contact 3D data acquisition system which can obtain an accurate, highresolution untextured point cloud. Imaging Procedure As the document required a minimum of 654 data sets to be fully imaged (verso and recto of each folio), data acquisition was split into a number of sessions. Each session typically represented a 4 hour block of work imaging a sequence of either recto or verso, with two sessions per day. Fifteen sessions were required to image the entire manuscript. Due to the size and fragility of the codex, it was moved as seldom as possible. Thus, each session was bookended by a set of calibration procedures before and after placing the manuscript in the cradle. In order for the extrinsic calibration to be accurate, neither the FARO nor the cameras could be moved during a session. of each page, it did not retain the same focal length for every page and would instead be focus on the current page depth. As a result, intrinsics (including focal length, distortion matrix, and center of radial distortion) were acquired using the average focus that the camera would be at during a session. Once all this information was obtained – photographs, 3D information, and calibration – we were able to map the photographs to the 3D mesh and unwrap it to its 2D shape, helping to make legible texts that may otherwise be difficult to read. • Open the barn door lights on both sides of the cradle The publication of legible, high-resolution digital photographs of the Venetus A manuscript is the first step towards an electronic edition of the text, scholia, and translation which will constitute a resource with far-reaching implications. Students and scholars of Greek literature, mythology, and palaeography could study in minute detail the most important manuscript of the Iliad. This codex has been seen by only a handful of people since its discovery and publication by Villoison in 1788 (Villoison 1788). Although Comparetti published a facsimile edition in 1901 (Comparetti 1901), very few copies of even this edition are available, even in research libraries, and the smallest writing in Comparetti’s facsimile is illegible.The Homer MultiText, with its high-resolution digital images linked to full text and translation, will make this famous manuscript freely accessible to interested scholars (and others) worldwide. • Photograph the page, via USB control Bibliography: • Close lights to minimize damage to the manuscript Baumann, R. Imaging and Flattening of the Venetus A. Master’s Thesis, University of Kentucky, 2007. The process for a typical page was thus: • Position the page and color calibration strips, then turn on the vacuum to suction the page down • Scan the page with the laser line probe on the ScanArm (due to the small width of the line, this last step usually took 2-5 minutes) Periodic spot-checks of the calibration data were carried out to ensure that good data was being obtained. Test flattenings were also performed to check that everything worked as expected. Relationship between 2D and 3D As the 2D and 3D data acquisition systems are completely separate (unlike the tightly coupled systems of (Brown 2001, Brown 2004, Brown 2007)), a calibration procedure was required to determine the correspondence between 3D points in the arm and 2D points in the camera. By calibrating the camera and ScanArm in relation to one another, we were able to determine the location of any point in the three dimensional space within view of the camera and to flatten the folios within that space. As we were unsure of the amount of control we would have over the primary photography, we used our own camera (a Canon EOS 20D, which was attached to the camera gantry immediately beside the Hasselblad) to obtain a set of control images for each page and calibration data for each session. The Canon was set to manual focus and a fixed focus and zoom, in order to have intrinsic camera calibration which would remain relatively stable. As the goal of the Hasselblad photography was to obtain the highest quality photographs Brown, M. S. and Seales, W. B. “Document Restoration Using 3D Shape: A General Deskewing Algorithm for Arbitrarily Warped Documents.” ICCV 02 (2001): 367-375. Brown, M. S. and Seales, W. B. “Image Restoration of Arbitrarily Warped Documents.” IEEE Transactions on Pattern Analysis and Machine Intelligence 26: 10 (2004): 1295-1306. Brown, M. S., Mingxuan, S.,Yang R.,Yun L., and Seales, W. B. “Restoring 2D Content from Distorted Documents.” IEEE Transactions on Pattern Analysis and Machine Intelligence 29: 11 (2007): 1904-1916. Comparetti, D. Homeri Ilias Cum Scholiis. Codex Venetus A, Marcianus 454. Phototypice editus, Praefatus est Dominicus Comparettix. Lugduni Batavorum: A. W. Sijthoff, 1901. Kiernan, K., Seales, W. B., and Griffioen, J. “The Reappearances of St. Basil the Great in British Library MS Cotton Otho B. x.” Computers and the Humanities 36: 1 (2002): 7-26. Lin,Yun. Physically-based Digital Restoration Using Volumetric Scanning. PhD Dissertation, University of Kentucky, 2007. De Villoison, J. B. G., ed. Homeri Ilias ad veteris codicis Veneti fidem recensita.Venice, 1788. _____________________________________________________________________________ 9 Digital Humanities 2008 _____________________________________________________________________________ Paper 3:The Homer Multitext: Infrastructure and Applications Christopher Blackwell and Neel Smith The Homer Multitext (HMT) seeks to create a library of materials documenting the history of the Homeric tradition, all of which will be freely available online. To support this aim, Neel Smith and Christopher Blackwell, in collaboration with other scholars, librarians, and technologists, have spent five years defining a set of protocols for a distributed digital library of “scholarly primitives”. The protocols aim to be as technologically agnostic as possible and to afford automatic discovery of services, the materials they serve, and the internal structures and citation schemes. In winter 2007, the initial contents of the Homer Multitext were published online using the first generation of web-based applications built atop implementations of these protocols. In this presentation we will describe and demonstrate these protocols and discuss how they contribute to the vision of the Homer Multitext. The TICI Stack The technical editors of the Center for Hellenic Studies call the set of technologies that implement these protocols the “TICI Stack”, with TICI being an acronym for the four categories of materials offered by the Homer Multitext Library: Texts, Images, Collections, and Indices. Texts The Texts components of the HMT are electronic editions and translations of ancient works: transcriptions of specific manuscripts of the Iliad and Odyssey, editions of Homeric fragments on papyrus, editions of ancient works that include Homeric quotations, and texts of ancient commentaries on Homeric poetry.These texts are marked up in TEI-Conformant XML, using very minimal markup. Editorial information is included in the markup where appropriate, following the EpiDoc standard. More complex issues that might be handled by internal markup (complex scholarly apparatus, for example, or cross-referencing) is not included; these matters are handled by means of stand-off markup, in indices or collections. Access to texts is via implementations of the Canonical Text Services Protocol (CTS), which is based on the Functional Requirements for Bibliographic Records (FRBR). CTS extends FRBR, however, by providing hierarchical access from the broadest bibliographic level (“textgroup”, or “author”), down through “work”, “edition/translation”, and into the contents of the text itself. The CTS protocol can point to any citeable unit of a text, or range of such units, and even more precisely to a single character. For example, the CTS URN: urn:cts:tlg0012.tlg001:chsA:10.1 points to Homer (tlg0012), the Iliad (tlg001), the edition catalogued as “chsA” (in this case a transcription of the text that appears on the manuscript Marcianus Graecus Z. 454 [= 822]), Book 10, Line 1. A CTS request, handed this URN, would return: Ἄλλοι μὲν παρὰ νηυσίν ἀριστῆες Παναχαιῶν A more specific URN would be: urn:cts:tlg0012.tlg001:chsA:10.1:ρ[1] would point to the second Greek letter “rho” in this edition of Iliad 10.1, which appears in the word ἀριστῆες. Images The HMT currently offers over 2000 high resolution images of the folios of three Homeric manuscripts, captured in the spring of 2007 at the Biblioteca Nationale Marciana in Venice. These images are accessible online through a basic directorylisting, but also through a web-based application, described below.To relate these images with bibliographic data, technical metadata, and associated passages of texts, the HMT relies on Collections and Indices. Collections A collection is a group of like objects, in no particular order. For the CTS, one example is a Collection of images, with each element in the collection represented by an XML fragment that records an id, pointing to a digital image file, and metadata for that image. Another example is a Collection of manuscript folios. Each member of this Collection includes the name of the manuscript, the folio’s enumeration and side (“12-recto”), and one or more CTS URNs that identify what text appears on the folio. Collections are exposed via the Collection Service, which allows discovery of the fields of a particular collection, querying on those fields, and retrieval of XML fragments. A final example would be a collection of lexicon of Homeric Greek. Indices Indices are simple relations between two elements, a reference and a value. They do much of the work of binding together TICI applications. The Manuscript Browser The first data published by the HMT Library were the images of manuscripts from Venice. These are exposed via the Manuscript Browser application, the first published application built on the TICI Stack. http://chs75.harvard.edu/manuscripts This application allows users to browse images based either on the manuscript and folio-number, or (more usefully) by asking for a particular manuscript, and then for a particular book and line of the Iliad. _____________________________________________________________________________ 10 Digital Humanities 2008 _____________________________________________________________________________ Users are then taken to a page for that folio. Online interaction with the high-resolution images is through the Google Maps API, and AJAX implementation that allows very responsive panning and zooming, without requiring users to download large files. From the default view, users can select any other views of that page (details, ultraviolet photographs, etc.), and can choose to download one of four versions of the image they are viewing: an uncompressed TIFF, or one of three sizes of JPEG. This application draws on a CTS service, two indices, and a collection of images. It queries a CTS service to retrieve bibliographic information about manuscripts whose contents are online--the CTS protocol is not limited to electronic texts, but can deliver information about printed editions as well. When a user requests a particular book and line of the Iliad, it queries one index whose data looks like this: <record> <ref>urn:cts:greekLit: tlg0012.tlg001:1.1</ref> <value>msA-12r</value> </record> Bibliography Canonical Texts Services Protocol: http://chs75.harvard.edu/ projects/diginc/techpub/cts Creative Commons: http://creativecommons.org/ EpiDoc: Epigraphic Documents in TEI XML: http://epidoc. sourceforge.net/ Google Maps API: http://www.google.com/apis/maps/ IFLA Study Group on the Functional Requirements of Bibliographic Records. “Functional Requirements of Bibliographic Records: final report.” München: K. G. Saur, 1998: http://www.ifla.org/VII/s13/frbr/frbr.pdf TEI P4: Guidelines for Electronic Text Encoding and Interchange. Edited by C. M. Sperberg-McQueen and Lou Burnard. The TEI Consortium: 2001, 2002, 2004. Here the <ref> element contains a URN that, in this case, points to Book 1, Line 1 of the Iliad, and the <value> element identifies folio 12, recto, of “manuscript A”. This allows the application to query a collection of manuscript folios and to determine which images are available for this folio. The display of an image via the Google Maps API requires it to be rendered into many tiles--the number depends on the size and resolution of the image, and the degree of zooming desired. For the images from Venice, we produced over 3,000,000 image tiles. This work was made possible by the Google Maps Image Cutter software from the UCL Centre for Advanced Spatial Analysis, who kindly added batch-processing capability to their application at our request. Next Steps Currently under development, as of November 2007, is the TICI Reader, a web-based application to bring texts together with indexed information and collections. This is intended to coincide with the completion of six editions of the Iliad, electronic transcriptions of important medieval manuscripts of the Iliad, and a new electronic edition of the medieval scholarly notes, the scholia, that appear on these manuscripts. Availability All of the work of the HMT is intended to be open and accessible. The images and texts are licensed under a Creative Commons License, all tools are based on open standards and implemented in open source technologies. And work is progressing, and more planned, on translations of as many of these materials as possible, to ensure that they reach the widest possible audience. _____________________________________________________________________________ 11 Digital Humanities 2008 _____________________________________________________________________________ Defining an International Humanities Portal Neil Fraistat nfraistat@gmail.com Maryland Institute for Technology in the Humanities, USA Domenico Fiormonte fiormont@uniroma3.it Università Roma Tre, Italy Ian Johnson johnson@acl.arts.usyd.edu.au University of Sydney, Australia Jan Christoph Meister jan-c-meister@uni-hamburg.de Universität Hamburg, Germany John Unsworth unsworth@uiuc.edu University of Illinois, Urbana-Champaign, USA What would a portal for humanists look like? Who would it serve and what services would it provide? centerNet, an international network of digital humanities centers, proposes a panel to discuss an ongoing conversation about the need for an International Humanities Portal (IHP) to serve students and researchers. The panelists will present different perspectives a number of questions we have posed to the larger network. The panel will both present their own positions on the case for a Portal and will report back on the collaborative development of a common Needs Analysis that identifies the need and the potential scope of such a portal. We propose a panel as a way to reflect back to the community the discussions so far and to solicit further input on the need and case for a common infrastructure. The participants bring an international perspective to the discussion that is central to what we imagine. Some of the questions the panelists will address are: 1. Who would be the users of a humanities portal? Would a portal serve only researchers or would it serve students and anyone else interested in the humanities? Would a portal serve primarily digital humanists or would it serve all humanists? 1. The need for a portal starts with the definition of the audience and how a portal might serve their interests. As the title suggests and the participants in the panel represent, we have also started from the position that research crosses national and linguistic boundaries and that we should therefore imagine an international portal that can be multilingual and support the needs of international research. 2. Who might support such a portal? How might the development of such a portal be funded and how would it be maintained? centerNet includes centers with access to development and support resources, but no center has the capacity to support a successful international portal. We therefore imagine a distributed model that will draw on expertise and support from the centerNet community and beyond. We also imagine that the development of a consensus about the need and scope of a portal by centerNet could help individual centers to secure support from national and other funding bodies to develop and maintain components. Ultimately we hope the support needed at any one center will be easier to secure and easier to maintain if backed up by an articulated Need Analysis from a larger body like centerNet. 3 What a humanities portal might look like? Will it actually be a portal or should we imagine providing services to existing university portals? Many universities are developing student and faculty portals as are scholarly associations and projects. Portals now seem dated in light of Web 2.0 social technologies. For this reason we imagine that an IHP will need to interoperate with other portals through “portlets” or OpenSocial (code.google.com/apis/opensocial/) plug-ins that can be added to project sites. The IHP will itself need to play well with other resources rather than aim to organize them. 4. What services might it provide? What services are already provided by centers and projects? Can the portal provide an entry point into the richness of existing resources? The panelists will survey the variety of resources already available to their communities and discuss what new resources are needed. Above all a portal should be a door into an area of inquiry -- panelists will reflect on what they and their community would expect to have easy access to from something that promised to be an International Humanities Portal. 5. Just as important as defining the services is imagining how services might interoperate. A humanist familiar with the web will know of resources from Voice of the Shuttle to Intute: Arts and Humanities, but can we imagine how these services might be brought together so that one can search across them? Can we imagine a mashup of services providing new and unanticipated possibilities? 6. What are the next steps? How can the case for an IHP be strengthened? How can centerNet help, not hinder, projects that want to develop and support components that meet the needs of the community? One of the purposes of this panel is to solicit feedback from the larger digital humanities community. Portals are typically characterized by two features. First, their users can customize their account to show or hide particular resources.. Thus users of the TAPoR portal, for example, can define texts and tools that they want to use and ignore the rest. _____________________________________________________________________________ 12 Digital Humanities 2008 _____________________________________________________________________________ The ease of customization and the depth of customization are important to users if they use portals regularly. Second, users expect that a portal provides access to the breadth of services and resources they need for a domain of inquiry.Thus students expect a student portal to provide access to all the online resources they need as a student, from the library account to their course information. A portal should be just that -- a door into a domain. Is it possible for an IHP to be on the one hand easy to use (and customizable), and on the other hand truly provide broad access to resources? Is it possible that an IHP is too ambitious and would either be to complicated for humanists to use or not broad enough in scope to be useful? The answer in part lies in imagining an initial audience and conducting the usability studies needed to understand what they would expect. Thus we imagine an iterative process of designing for a first set of users and redesigning as we learn more. This means a long term development commitment which is beyond most project funding. The challenges are enormous and we may be overtaken by commercial portals like Google or Yahoo that provide most of what humanists need. We believe, however, that the process of defining the opportunities and challenges is a way to recognize what is useful and move in a direction of collaboration. We hope that in the discussion modest first steps will emerge. Bibliography centerNet: http://www.digitalhumanities.org/centernet/ Detlor, Brian. Towards knowledge portals: From human issues to intelligent agents. Dordrecht, The Netherlands: Kluwer Academic Publishers, 2004. Intute: Arts and Humanities: http://www.intute.ac.uk/ artsandhumanities/ TAPoR: Text Analysis Portal for Research: http://portal.tapor. ca VoS:Voice of the Shuttle: http://vos.ucsb.edu/ Beyond Search: Literary Studies and the Digital Library Matthew Jockers mjockers@stanford.edu Stanford University, USA Glen Worthey gworthey@stanford.edu Stanford University, USA Joe Shapiro jpshapir@stanford.edu Stanford University, USA Sarah Allison sdalliso@stanford.edu Stanford University, USA, The papers in this session are derived from the collaborative research being conducted in the “Beyond Search and Access: Literary Studies and the Digital Library” workshop at Stanford University. The workshop was set up to investigate the possibilities offered by large literary corpora and how recent advances in information technology, especially machine learning and data-mining, can offer new entry points for the study of literature. The participants are primarily interested in the literary implications of this work and what computational approaches can or do reveal about the creative enterprise.The papers presented here share a common interest in making literary-historical arguments based on statistical evidence harvested from large text collections. While the papers are not, strictly speaking, about testing computational methods in a literary context, methodological evaluations are made constantly in the course of the work and there is much to interest both the literary minded and the technically inclined. Abstract 1: “The Good, the Bad and the Ugly: Corralling an Okay Text Corpus from a Whole Heap o’ Sources.” Glen Worthey An increasingly important variety of digital humanities activity has recently arisen among Stanford’s computing humanists, an approach to textual analysis that Franco Moretti has called “distant reading” and Matthew Jockers has begun to formally develop under the name of macro-analysis. At its most simple, the approach involves making literary-historical arguments based on statistical evidence from “comprehensive” text corpora. One of the peculiarities of this activity is evident from the very beginning of any foray into it: it nearly always requires the creation of “comprehensive” literary corpora. This paper _____________________________________________________________________________ 13 Digital Humanities 2008 _____________________________________________________________________________ discusses the roles and challenges of the digital librarian in supporting this type of research with specific emphasis on the curation (creation, licensing, gathering, and manipulation) of a statistically significant corpus. Because the arguments at stake are largely statistical, both the size of the “population” of texts under examination, and any inherent bias in the composition of the corpus, are potentially significant factors. Although the number of available electronic texts from which to draw a research corpus is increasing rapidly, thanks both to mass digitization efforts and to expanding commercial offerings of digital texts, the variability of these texts is likewise increasing: no longer can we count on working only with well-behaved (or even well-formed) TEI texts of canonical literary works; no longer can we count on the relatively error-free re-keyed, proofread, marked-up editions for which the humanities computing community has spent so much effort creating standards and textual corpora that abide by these standards. In addition to full texts with XML (or SGML) markup that come to us from the model editions and exemplary projects that we have come to love (such as the Brown Women Writers Project), and the high-quality licensed resources of similar characteristics (such as Chadwyck Healey collections), we are now additionally both blessed and cursed with the fruits of mass digitization (both by our own hands and by commercial and consortial efforts such as the Google Library Project and Open Content Alliance), which come, for the most part, as uncorrected (or very lightly corrected) OCR. Schreibman, et al., 2008, have described University of Maryland efforts to combine disparate sources into a single archive. And a similar effort to incorporate “messy data” (from Usenet texts) into usable corpora has been described in Hoffman, 2007; there is also, of course, a rich literature in general corpus design. This paper will discuss the application of lessons from these other corpus-building projects to our own inclusion of an even more variant set of sources into a corpus suitable for this particular flavor of literary study. This digital librarian’s particular sympathies, like perhaps those of many or most who participate in ADHO meetings, lie clearly with what we might now call “good,” “old-fashioned,” TEIbased, full-text digital collections. We love the perfectionism of “true” full text, the intensity and thoughtfulness and possibility provided by rich markup, the care and reverence for the text that have always been hallmarks of scholarly editing, and that were embraced by the TEI community even for less formal digital editions. Now, though, in the days of “more” and “faster” digital text production, we are forced to admit that most of these lessthan-perfect full-text collections and products offer significant advantages of scale and scope.While one certainly hesitates to claim that quantity actually trumps quality, it may be true that quantity can be transformed into quality of a sort different from what we’re used to striving for. At the same time, the “macro-analysis” and “distant reading” approaches demand corpora on scales that we in the “intensely curated” digital library had not really anticipated before. As in economics, it is perhaps difficult to say whether increased supply has influenced demand or vice versa, but regardless, we are certainly observing something of a confluence between mass approaches to the study of literary history and mass digitization of its textual evidence. This paper examines, in a few case studies, the gathering and manipulation of existing digital library sources, and the creation of new digital texts to fill in gaps, with the aim of creating large corpora for several “distant reading” projects at Stanford. I will discuss the assessment of project requirements; the identification of potential sources; licensing issues; sharing of resources with other institutions; and the more technical issues around determining and obtaining “good-enough” text accuracy; “rich-enough” markup creation, transformation and normalization; and provision of access to the corpora in ways that don’t threaten more everyday access to our digital library resources. Finally, I’ll discuss some of the issues involved in presenting, archiving, and preserving the resources we’ve created for “special” projects so that they both fit appropriately into our “general” digital library, and will be useful for future research. References Sebastian Hoffmann: Processing Internet-derived Text— Creating a Corpus of Usenet Messages. Literary and Linguistic Computing Advance Access published on June 1, 2007, DOI 10.1093/llc/fqm002. Literary and Linguistic Computing 22: 151-165. Susan Schreibman, Jennifer O’Brien Roper, and Gretchen Gueguen: Cross-collection Searching: A Pandora’s Box or the Holy Grail? Literary and Linguistic Computing Advance Access published on April 1, 2008, DOI 10.1093/llc/fqm039. Literary and Linguistic Computing 23: 13-25. Abstract 2: “Narrate or Describe: Macro-analysis of the 19th-century American Novel” Joe Shapiro Does the 19th-century American novel narrate, or does it describe? In 1936, Georg Lukács argued that over the course of the nineteenth century description replaced narration as the dominant mode of the European (and especially French) novel. What happens, we want to know, for the American novel? We begin with a structuralist distinction between the logics of narration and description: narration as the articulation of the events that make up a story; description as the attribution of features to characters, places, and objects in the story. We then set out, with the help of computational linguistics, to identify the computable sentence-level stylistic “signals” of narration and description from a small training sample (roughly _____________________________________________________________________________ 14 Digital Humanities 2008 _____________________________________________________________________________ 2,000 sentences), and thus to develop a model that can classify narration and description in “the wild”—a corpus of over 800 American novels from 1789 to 1875. This paper will discuss how we have refined our research problem and developed our classifying model—trial-and-error processes both; our initial results in “the wild”; and finally how macro-analysis of this kind leads to new problems for literary history. Existing scholarship suggests that “realist” description enters the American novel with the historical novel; thus our initial training set of samples was taken from 10 American historical novels from the 1820s, 1830s, and 1840s (by J.F. Cooper and his rivals). Participants in the Beyond Search workshop have tagged random selections from these 10 novels. The unit of selection is, for convenience, the chapter. Selected chapters are broken (tokenized) into individual sentences and humantagged using a custom XML schema that allows for a “type” attribute for each sentence element. Possible values for the type attribute include “Description,” “Narration,” “Both,” “Speech,” and “Other.” Any disagreement about tagging the training set has been resolved via consensus. (Since the signals for description may change over time—indeed, no small problem for this study—we plan to add an additional training sample from later in the corpus.) Using a maximum-entropy classifier we have begun to investigate the qualities of the evolving training set and to identify the stylistic “signals” that are unique to, or most prevalent in, narrative and descriptive sentences. In the case of description, for example, we find a marked presence of spatial prepositions, an above average percentage of nouns and adjectives, a relatively low percentage of first and second person pronouns, above average sentence lengths, and a high percentage of diverse words (greater lexical richness). From this initial work it has become clear, however, that our final model will need to include not simply word usage data, but also grammatical and lexical information, as well as contextualizing information (i.e., the kinds of sentence that precede and follow a given sentence, the sentence’s location in a paragraph). We are in the process of developing a model that makes use of part of speech sequences and syntactic tree structures, as well as contextualizing information. After a suitable training set has been completed and an accurate classifying model has been constructed, our intention is to “auto-tag” the entire corpus at the level of sentence. Once the entire corpus has been tagged, a straightforward quantitative analysis of the relative frequency of sentence types within the corpus will follow. Here, the emphasis will be placed on a time-based evaluation of description as a feature of 19th-century American fiction. But then, if a pattern emerges, we will have to explain it—and strictly quantitative analysis will need to be supplemented by qualitative analysis, as we interrogate not just what mode is prevalent when, but what the modes might mean at any given time and how the modes themselves undergo mutation. Abstract 3: “Tracking the ‘Voice of Doxa’ in the Victorian Novel.” Sarah Allison The nineteenth-century British novel is known for its moralizing: is it possible to define this “voice of doxa,” or conventional wisdom, in terms of computable, sentence-level stylistic features? A familiar version of the voice of doxa is the “interrupting narrator,” who addresses the reader in the second person in order to clarify the meaning of the story. This project seeks to go beyond simple narrative interruption to the explication of ethical signals emitted in the process of characterization (how the portrayal of a character unfolds over the course of a novel). It also takes a more precise look at the shift in tense noted by narratologists from the past tense of the story to the present-tense of the discourse, in which meaning can be elaborated in terms of proverbs or truisms. (An example from Middlemarch: “[Fred] had gone to his father and told him one vexatious affair, and he had left another untold: in such cases the complete revelation always produces the impression of a previous duplicity” 23). This project is the first attempt to generate a set of micro- stylistic features that indicate the presence of ethical judgment. Through an analysis of data derived through ad hoc harvesting of frequently occurring lexical and syntactic patterns (e.g. word frequencies, part of speech saturation, frequent grammatical patterns, etc) and data derived through the application of supervised classification algorithms this research attempts to determine a set of grammatical features that tend to cluster around these moments of direct narrative discourse. This research is an alternative application of the method developed in Joe Shapiro’s Beyond Search workshop project (see abstract above), which seeks to identify computable stylistic differences between narrative and descriptive prose in 19th century American fiction. In this work we seek to create another category within the “descriptive,” a subcategory that captures moments of explicitly moralized description: the Voice of Doxa. We identify the formal aspects of this authorial “voice” in order to “hunt” for similar moments, or occurrences, in a series of novels. The work begins with a limited search for patterns of characterization evident among characters in a single George Eliot novel; from this we develop a model, which we will then apply to the entire Eliot corpus. In the end, the target corpus is extended to include 250 19th century British novels wherein we roughly chart the evolutionary course of the “ethical signal.” _____________________________________________________________________________ 15 Digital Humanities 2008 _____________________________________________________________________________ Designing, Coding, and Playing with Fire: Innovations in Interdisciplinary Research Project Management Stephen Ramsay sramsay@unlserve.unl.edu University of Nebraska at Lincoln, USA Stéfan Sinclair sgsinclair@gmail.com McMaster University, Canada John Unsworth The final paper, entitled “Rules of the Order,” considers the sociology of large, multi-institutional software development projects. Most people would be able to cite the major obstacles standing in the way of such collaborations, like the lack of face-to-face communication. But even with the best methods of communication available, researchers still need to acknowledge the larger “sociology” of academia and develop ways of working that respect existing institutional boundaries, workflows, and methodologies. Hackfests, Designfests, and Writingfests:The Role of Intense Periods of Face-to-Face Collaboration in International Research Teams unsworth@uiuc.edu University of Illinois, Urbana-Champaign, USA Stan Ruecker, Milena Radzikowska, and Stéfan Sinclair Milena Radzikowska Extended Abstract mradzikowska@gmail.com Mount Royal College, Canada As the authors of this paper have become increasingly active in international, interdisciplinary, collaborative research projects, we have progressively adopted a growing range of online tools to help manage the activities. We’ve used Basecamp for task assignment, Meeting Wizard for scheduling conference calls, project listserves for discussion, and research wikis for archiving the results. We have SVN for software versioning, and Jira for tracking and prioritizing programming tasks. Our websites provide a public face, with project rationales, contact information, and repositories for research presentations and articles. We also try whenever possible to provide online prototypes, so that our readers have the opportunity to try our interface experiments for themselves, with appropriate warnings that things might be held together on the back end with duct tape, and participant agreements for those occasions when we hope to capture log records of what they are doing. Stan Ruecker sruecker@ualberta.ca University of Alberta, Canada From its inception, Digital Humanities has relied on interdisciplinary research. Nationally and internationally, our experience of doing research involving people from various disciplines has become increasingly important, not only because the results are often innovative, fascinating, and significant, but also because the management tools and practices we have been developing are potentially useful for a wide range of researchers. In this session, we will discuss three very different aspects of interdisciplinary project management, based on our combined experience of decades of collaboration. The first paper, “Hackfests, Designfests, and Writingfests,” is an attempt to pull together a summary of the best practices we have been developing in the area of planning and carrying out face-to-face work sessions. Over the past three years, we have carried out a total of fourteen Hackfests, involving programmers, designers, and writers. In studying our own results, we have generated practical guidelines on topics such as the choice of locations, the assignment of tasks, and the handling of logistics. The second paper, “Hackey,” provides a description of a new online tool that we have been developing for use by programmer/designer pairs, to encourage small but frequent activity on a project. It often happens that people make commitments, especially to virtual teams, that result in them working in large but infrequent blocks of time, as at the Hackfests, when it may be more productive in the long run if they are able to devote instead an hour a day.We’ve attempted to address this problem through the design of a game that uses the concept of volleying back and forth on a specific topic, all under the watchful eye of a benevolent third-party pit boss. All of the online tools we use contribute to our productivity. However, we have also become increasingly convinced of the benefits of periodic face-to-face working sessions, where we gather from the twelve-winded sky to focus our attention on the project. In the words of Scott Ambler, “have team, will travel.” The value of such face-to-face meetings is well understood by distance educators (e.g. Davis and Fill 2007), who see such focused time as a significant factor in student learning. We have been speculatively trying variations of our working sessions, such as choice of location, numbers of participants, types of work, and other logistical details like meals and entertainment. Over the past three years, we have held fourteen such events. They have involved a number of participants ranging from as few as two to as many as fifteen. Nine of the events were dedicated to writing and designing; three were used for programming and designing; one was devoted just to programming, and another to just designing. Our goal in this paper is to provide the insights we’ve gained from these variations for the benefit of other researchers who are interested in trying their own project-oriented work fests. _____________________________________________________________________________ 16 Digital Humanities 2008 _____________________________________________________________________________ Types of Work Participants It is important that these not be meetings. No one expects to get work done at a meeting, not in the sense of writing a paper, designing a system, or programming, so the expectations are at odds with the goal of the exercise. Meetings typically carry a lot of connotative baggage, not least of which is inefficiency and frustration, as expressed by the educational videos with John Cleese entitled “Meetings, Bloody Meetings.” Humanists are rarely trained in conducting meetings efficiently and productively, and it shows. By emphasizing to all the participants that the goal is to work productively together in the same room, we can hit the ground running and get a lot accomplished in a short period of time. We have in almost every case combined people who are working on different kinds of tasks. In the two instances where we had only a single task (programming and designing, respectively) we had a sense of accomplishing much less. It is difficult in some ways to verify this perception, but the energy levels of participants in both cases seemed lower. Our interpretation would be, not that there is competition per se between different disciplines, but that it is harder to judge how much another disciplinary group is accomplishing, and that uncertainty helps to keep people motivated. In general, the participants on our research projects know what they are doing. In fact, many of them are eager to get going on tasks that have fallen behind due to the other demands on their time. So in general, it has not been necessary to provide a large administrative overhead in terms of controlling and monitoring their activities during the event. Given the nature of interdisciplinary collaboration, too much attempt to control is counter-productive in any case. However, the nature of task assignment during a hackfest is a matter that may require some finesse. We have made errors where we had someone working on an assignment that was not the kind of assignment they could tackle with enthusiasm. In these cases, a little preliminary consultation with the individual participant would have made a big improvement. This consultation could have occurred either before the hackfest or even upon arrival, but it was a misstep not to have it early. In terms of numbers of participants, it appears that you need enough people to generate momentum. Two or three is not enough, although four seems to be. Our largest group so far has been fifteen, which is somewhat large to co-ordinate, and they end up working anyway in smaller groups of 2-5, but a larger group also makes the event more dynamic and can provide greater opportunities for cross-disciplinary consultation. Location, Location, Location There are many temptations to locate the events in places that will be convenient.We have found that in this instance it is better to avoid temptation if possible. For example, if there are five participants and three are from the same city, why not hold the event in that city? It reduces costs, in particular for travel and accommodations, and the participants can also serve as hosts for their out-of-town colleagues. In practice, however, we have found that holding a hackfest in some of the participant’s hometown, lowers the hackfest’s productivity level. The biggest obstacle is the local demands on the attention of the resident participants.We have found that local participants are constantly on call for work and domestic commitments, which means they tend to absent themselves from the event. One local host for a comparatively large number of participants is less problematic, since the periodic absence of one participant is not as critical a factor in maintaining the momentum of the entire group, but for small groups it can be significant. An alternative is therefore to transport the participants to a pleasant working location, provided that the attractions of the location are not so powerful as to be distracting. We are all familiar with conferences in exotic locations where participants have difficulty focusing on the conference because the beach is calling. Given this mixture of considerations, our current guideline is therefore to use a location no closer than an hour’s travel from some cluster of the participants, so that both the overall travel costs and local demands for time can be reduced. Our hackfests are set up according to a flat organizational structure. All participants are treated as equals and decisions are, as much as possible, based on consensus. Even when planning a hackfest, we consult with all invited participants on potential locations, duration, and objectives. In rare cases where leadership is required during the event, it is clear to everyone that the overriding structure is based on equality. For mid-sized groups, we find that it is useful to have a “rover” or two, who constantly moves between the different groups and tries to identify potential problems and opportunities for bridging group efforts. A further point related to participants is to, whenever possible, consider the past, present, and future of the project. It is too easy to focus on what needs to be done immediately and to only include those who are currently active in the project. However, we have found that including previous participants – even if they are not currently active on the project – helps to remind the team of some of the institutional history, providing some context for past decisions and the current situation. Similarly, a hackfest is an excellent opportunity to recruit prospective new colleagues in an intense and enjoyable training experience. Meals and Accommodations We want these events to be highlights for people on the projects, so it is important that the meals and accommodations be the best that we can reasonably afford. Eating, snacking, and coffee breaks provide time for discussions and developing social capital, both of which are significant for the success of a research project. We have a senior colleague in computing _____________________________________________________________________________ 17 Digital Humanities 2008 _____________________________________________________________________________ science, not related to our group, who holds hackfests at his cottage, where the graduate students take turns doing the cooking.We would hesitate to recommend this approach, since there would be lost time for the cook that could be more productively used on the project. At our most recent event, we hired a caterer. This strategy also worked well because it removed the added stress of finding, three times a day, a good place to feed everyone. In terms of sleeping accommodations, we have tended to provide one room per participant, since the additional stress of living with a roommate seems an unnecessary potential burden to introduce; private rooms can also be a space to individually unwind away from what may be an unusually concentrated period of time with the same group of colleagues. Work Rooms The situated environment of work is very important. We have worked in the common areas of cluster housing, in the lounges of hotels, in boardrooms, in university labs, and on one occasion in a rustic lodge. We hesitate to say it, because there were problems with the internet connectivity and task assignment, but our four days in the rustic lodge were probably the most successful so far. We were in rural Alberta, but there was a nearby pond, a place for ultimate Frisbee games, and there were fire pits, hot tubs, and wood chopping that we could amuse ourselves with during breaks. There is a value value in having varied forms of entertainment and leisure, which can provide mental breaks. We prefer if possible to have a mix of physical activities and pure entertainment (music, movies), although it is important to be sensitive to different tastes. Pair Programming Even when there are a dozen participants, it is normally the case that we end up working in smaller groups of 2, 3, or 4.The dynamics of the group and the task can indicate the appropriate group size. We have also recently tried pair programming, where two programmers both work on the same computer. It seems to be a useful approach, and we intend to do more of it. However, the type of activity that can be paired-up seems to depend on the nature of the participants’ discipline and typical work approach. For example, we have found that two designers work well together on concept planning and paper prototypes, but they need divided tasks, and two computers, when it comes to creating a digital sketch. Wrapping-up We think that if participants feel, at the end of the hackfest, that the project has moved forward and their time was well spent, they tend to continue progressing on the project. We have tried a few different methods to create that sense of conclusion and, therefore, satisfaction. At Kramer pond, for example, we concluded the hackfest by a show-andtell of tasks accomplished during the event, and an agreement about the next steps participants would take upon returning home. We have also tried to end a hackfest with a conference paper submission. This strategy has had mixed results – about half of the intended papers managed to reach the submission stage. Documentation Increasingly,we have focused a portion of our hackfest energy on documenting the event.We have used image posting sites, such as flickr.com, wiki posts, and video recording. Documentation serves as a reminder of agreed-upon and accomplished tasks, fosters a sense of satisfaction with the activity, and maintains a record of successes (and failures) emergent from the activity. Future Directions Although there are costs associated with hackfesting, we feel that the benefits so far have been tremendous. Projects typically take a quantum leap forward around the momentum that can be gained at a hackfest. We need to learn more about ramping up, and we are also interested in better documentation of the events themselves, perhaps in the form of videography. References Ambler, Scott. 2002. “Bridging the Distance.” http://www.ddj. com/article/printableArticle.jhtml?articleID=184414899&dep t_url=/architect/ Davis, Hugh, and Karen Fill. 2007. Embedding Blended Learning in a University’s Teaching Culture: Experiences and Reflections. British Journal of Educational Technology, 38/5, pp. 817-828. Hackey: A Rapid Online Prototyping Game for Designers and Programmers Stéfan Sinclair and Stan Ruecker Extended Abstract Researchers in Humanities Computing are frequently involved in the production of new online software systems. The specific roles vary, with some researchers work primarily as humanists interested in building a tool in order to carry out their own next project, while others may be archivists creating a digital collection as a public resource, programmers developing prototype systems for testing a concept, or designers experimenting with new approaches to interfaces or visualizations. The semi-serious slogan of the Text Analysis Developer’s Alliance (TADA) summarizes the activity in this way: “Real humanists make tools.” In the case of design of the tools, it is useful to consider three levels, which in some contexts we have referred to as basic, advanced, and experimental. A basic tool is one that will run without difficulty in every contemporary browser. It provides functionality that everyone expects. A search function is a good example. In a basic tool, it is usual to follow the current best practices for online design, with due attention to the standard heuristics that can help _____________________________________________________________________________ 18 Digital Humanities 2008 _____________________________________________________________________________ ensure that the result is usable. Nielsen (2000) provides a list of ten such criteria: (1) Use simple and natural dialogue. (2) Speak the users’ language. (3) Minimize user memory load. (4) Consistency. (5) Feedback. (6) Clearly marked exists. (7) Shortcuts. (8) Good error messages. (9) Prevent errors. (10) Help and documentation. An advanced tool is a basic tool with more features. For example, an advanced search function might provide support for Boolean operators, or allow nested queries, or include proximity searches, where a search is successful if the target words occur with in a specified number of words from each other. Advanced tools may have an experimental dimension, but they are still largely recognizable as something a typical user can be expected to have seen before.They are still within the realm of contemporary best practices. In the case of experimental tools, however, the goal is not primarily to implement the current best practices, but instead to experiment with new visual forms for information display and online interaction. Often the prototype will serve as the basis for user testing involving a new opportunity for action. Examples might include Shneiderman et al’s (1992) concept of direct manipulation, where the data and the interface are so tightly coupled that the visualization is the interface, Bederson’s (2001) approach to zoomable user interfaces (ZUIs) where levels of magnification become central to interaction, or Harris and Kamvar’s (2007) constellation approach to emotional expression in the blogosphere, where the results of blog scrapes are plotted as patterns that dynamically emerge from a constantly changing visual field.As Lev Manovich put it during a question and answer period in a paper session at Digital Humanities 2007, “in this line of research, a prototype is itself a theory.” The experimental design of interfaces and visualization tools is a special case for researchers in humanities computing, because it involves the extension of concepts into areas that are not well defined. In our research projects, we typically have a content expert, programmer, and visual communication designer working closely together in a cycle of static sketching, kinetic sketching, prototyping, and refinement, where each iteration may require as much as several months or as little as a few hours. One of our ongoing interests is in finding ways to facilitate the process. In this paper, we describe our initial experiments in creating an online system that encourages the research team to focus on details taken one at a time, in order to support rapid turnaround in design and development. Hackey provides the designer and programmer with a means of focusing on individual issues in the process, and rapidly exchanging idea, sketches, and prototypes. The game also includes a third role, in the form of the “pit boss” who lurks on the conversation and can step in to provide guidance or suggestions. Either player can also choose during a turn to address a question to the pit boss, usually in the form of clarification or appeal for arbitration of some kind. In addition to encouraging rapid turnaround, Hackey is also intended to encourage fun. By treating the exchanges as a kind of collaborative game, there is room for the players to contextualize their activity as more than a task list or a series of Jira tickets. Hackey provides a framework for discussions that can be at the same time serious and lighthearted. The principle is that the bounds of discourse are opened slightly by the game metaphor, to allow for mutual interrogation of the underlying assumptions held by both the designer and the programmer. This framework addresses an ongoing difficulty with interdisciplinary collaboration, where experts from two fields need to be able to find common ground that can accommodate the expertise of both, without undo compromise of the disciplinary expertise of either party. What we would like to avoid is the situation where the programmers feel that they are “working for” the designers, or the designers feel that they are not considered important partners in the process by the programmers, and that their proposals can be rejected out of hand without discussion. In this respect, the third party “pit boss” role is useful, since in extreme situations it can serve as a court of second appeal for either of the players, and under normal conditions it constitutes a presence that helps maintain the spirit of free and open exchange of ideas. Future Directions The Hackey prototype is still in its early stages, and we are still in the phase of experimentation, but we hope to make it more widely available for interdisciplinary teams who are seeking new methods for collaborative work. By logging their dialogues in the system and treating these records as an objective of interpretive study, we hope to be able to elaborate the original concept with additional features appropriate to the general goals. References Bederson, B. 2001. PhotoMesa: A zoomable image browser using quantum treemaps and bubblemaps. Proceedings of the 14th annual ACM symposium on user interface software and technology. 71–80. http://doi.acm.org/10.1145/502348.502359. Harris, J. and S. Kamvar. 2007. We feel fine. http://wefeelfine. org. MONK. 2007. Metadata Offer New Knowledge. www. monkproject.org. Nielsen, J. 2000. Designing web usability:The practice of simplicity. Indianapolis, IN: New Riders. Shneiderman, B., Williamson, C. and Ahlberg, C. (1992). Dynamic Queries: Database Searching by Direct Manipulation. Proceedings of the SIGCHI conference on human factors in computing systems. pp. 669-670. http://doi.acm. org/10.1145/142750.143082. Accessed March 10, 2006. _____________________________________________________________________________ 19 Digital Humanities 2008 _____________________________________________________________________________ Rules of the Order:The Sociology of Large, Multi-Institutional Software Development Projects Stephen Ramsay Digital Humanities is increasingly turning toward tool development, both as a research modality and as a practical means of exploiting the vast collections of digital resources that have been amassing over the last twenty years. Like the creation of digital collections, software development is an inherently collaborative activity, but unlike digital collections, complex software systems do not easily allow for loosely aggregated collaborations. With digital text collections, it might be possible to have several contributors working at their own pace—even at several institutions—with a high level of turnover and varying local workflows, but this is rarely possible with software development. However discretized the components of a system might be, the final product is invariably a tightly integrated whole with numerous dependencies and a rigidly unified architectural logic. For most digital humanities practitioners, amassing a team of developers almost requires that work be distributed across institutions and among a varied group of people. Any nontrivial application requires experts from a number of different development subspecialties, including such areas as interface design, relational database management, programming, software engineering, and server administration (to name only a few). Few research centers in DH have the staff necessary for undertaking large application development projects, and even the ones that do quickly find that crossdepartmental collaborations are needed to assemble the necessary expertise. There are a great many works on and studies of project management for software development, but very few of these resources address the particular situation of a widely distributed group of developers. Most of the popular methodologies (Rational Unified Process, Theory of Contraints, Event Chain) assume a group of developers who are all in one place with a clear management structure. More recent “agile methods” (such as Extreme Project Management) invariably stipulate a small groups of developers. Reading the ever useful (if slightly dispiriting) works on project failure, such as Brooks’s famous 1975 book The Mythical Man Month, one wonders whether project success is even possible for a distributed group. Development of open source software might seem an instructive counter-example of project success arising from massive, distributed teams of developers, but as several studies have shown, the core development team for even the largest software development projects (Apache, Mozilla, and the Linux kernel, for example) are considerably smaller than the hundreds of people listed as contributors might suggest, and in most cases, the project was able to build upon a core code base that was originally developed in a single location with only a few developers. The MONK Project, which endeavors to build a portable system for undertaking text analysis and visualization with large, full text literary archives, might seem a perfect storm of management obstacles. The project has two primary investigators, half a dozen co-investigators, nearly forty project participants at eight different institutions, and includes professors of all ranks, staff programmers, research fellows, and graduate students. The intended system, moreover, involves dozens of complex components requiring expertise in a number of programming languages, architectural principles, and data storage methods, and facility with a broad range of text analytical techniques (including text mining and statistical natural language processing). This paper offers some reflections on the trials and triumphs of project management at this scale using the MONK Project as an example. It focuses particularly on the sociology of academia as a key element to be acknowledged in management practice, and concludes that honoring existing institutional boundaries and work patterns is essential to maintaining a cohesive (if, finally, hierarchical) management structure. It also considers the ways in which this apparently simple notion runs counter to the cultural practices and decision making methods of ordinary academic units—practices which are so ingrained as to seem almost intuitive to many. The MONK Project has from the beginning studiously avoided references to cloisters and habits in its description of itself. This paper offers a playful analogy in contrast to that tradition, by suggesting that most of what one needs to know about project management was set forth by Benedict of Nursia in his sixth-century Rule for monks. Particular emphasis is placed on Benedict’s belief that social formations become increasingly fractured as they move from the cenobitic ideal (which, I suggest, is roughly analogous to the structure of the typical research center), and his belief that the Abbot (like the modern project manager) must assert a kind of benevolent despotism over those within his charge, even if the dominant values of the group privilege other, more democratic forms of leadership. References Benedict of Nursia. Regula. Collegeville, Minn.: Liturgical Press, 1981. Brooks, Frederick. The Mythical Man Month: Essays on Software Development. Reading, MA: Addison-Wesley, 1975. DeCarlo, Douglas. Extreme Project Management. New York: Wiley-Jossey, 2004. Kruchten, Philippe. The Rational Unified Process: An Introduction. 3rd ed. Reading, Mass: Addison-Wesley, 2003. Leach, Lawrence P. Critical Chain Project Management. 2nd ed. Norwood, MA.: Artech, 2004. _____________________________________________________________________________ 20 Digital Humanities 2008 _____________________________________________________________________________ Mockus, Audris. Roy T. Fielding, James D. Herbsleb. “Two Case Studies of Open Source Software Development: Apache and Mozilla.” ACM Transactions on Software Engineering and Methodology 11.3 (2002): 309-46. Skinner, David C. Introduction to Decision Analysis. 2nd ed. Sugar Land, TX: Probabilistic, 1999. Aspects of Sustainability in Digital Humanities Georg Rehm georg.rehm@uni-tuebingen.de Tübingen University, Germany Andreas Witt andreas.witt@uni-tuebingen.de Tübingen University, Germany The three papers of the proposed session, “Aspects of Sustainability in Digital Humanities”, examine the increasingly important topic of sustainability from the point of view of three different fields of research: library and information science, cultural heritage management, and linguistics. Practically all disciplines in science and the humanities are nowadays confronted with the task of providing data collections that have a very high degree of sustainability. This task is not only concerned with the long-term archiving of digital resources and data collections, but also with aspects such as, for example, interoperability of resources and applications, data access, legal issues, field-specific theoretical approaches, and even political interests. The proposed session has two primary goals. Each of the three papers will present the most crucial problems that are relevant for the task of providing sustainability within the given field or discipline. In addition, each paper will talk about the types of digital resources and data collections that are in use within the respective field (for example, annotated corpora and syntactic treebanks in the field of linguistics). The main focus, however, lies in working on the distinction between field-specific and universal aspects of sustainability so that the three fields that will be examined – library and information science, cultural heritage management, linguistics – can be considered case studies in order to come up with a more universal and all-encompassing angle on sustainability. Especially for introductory texts and field – independent best-practice guidelines on sustainability it is extremely important to have a solid distinction between universal and field-specific aspects. The same holds true for the integration of sustainability-related informational units into field-independent markup languages that have a very broad scope of potential applications, such as the TEI guidelines published by the Text Encoding Initiative. Following are short descriptions of the three papers: The paper “Sustainability in Cultural Heritage Management” by Øyvind Eide, Christian-Emil Ore, and Jon Holmen discusses technical and organisational aspects of sustainability with regard to cultural heritage information curated by institutions such as, for example, museums. Achieving organisational sustainability is a task that not only applies to the staff of a museum but also to education and research institutions, as well as to national and international bodies responsible for our common heritage. _____________________________________________________________________________ 21 Digital Humanities 2008 _____________________________________________________________________________ Vital to the sustainability of collections is information about the collections themselves, as well as individual items in those collections. “Sustaining Collection Value: Managing Collection/ Item Metadata Relationships”, by Allen H. Renear, Richard Urban, Karen Wickett, Carole L. Palmer, and David Dubin, examines the difficult problem of managing collection level metadata in order to ensure that the context of the items in a collection is accessible for research and scholarship. They report on ongoing research and also have preliminary suggestions for practitioners. The final paper “Sustainability of Annotated Resources in Linguistics”, by Georg Rehm, Andreas Witt, Erhard Hinrichs, and Marga Reis, provides an overview of important aspects of sustainability with regard to linguistic resources. The authors demonstrate which of these several aspects can be considered specific for the field of linguistics and which are more general. Paper 1: Sustainability in Cultural Heritage Management Øyvind Eide, Christian-Emil Ore, Jon Holmen University of Oslo, Norway Introduction During the last decades, a large amount of information in cultural heritage institutions have been digitised, creating the basis for many different usage scenarios.We have been working in this area for the last 15 years, through projects such as the Museum Project in Norway (Holmen et al., 2004). We have developed routines, standardised methods and software for digitisation, collection management, research and education. In this paper, we will discuss long term sustainability of digital cultural heritage information. We will discuss the creation of sustainable digital collections, as well as some problems we have experienced in this process. We have divided the description of sustainability in three parts. First we will describe briefly the technical part of sustainability work (section 2). After all, this is a well known research area on its own, and solutions to many of the problems at hand are known, although they may be hard to implement. We will then use the main part of the paper to discuss what we call organisational sustainability (section 3), which may be even more important than the technical part in the future — in our opinion, it may also be more difficult to solve. Finally, we briefly address the scholarly part of sustainability (section 4). Technical Sustainability Technical sustainability is divided in two parts: preservation of the digital bit patterns and the ability to interpret the bit pattern according to the original intention. This is an area where important work is being done by international bodies such as UNESCO (2003), as well as national organisations such as the Library of Congress in the USA (2007), and the Digital Preservation Coalition in the UK (DPC,2001). It is evident that the use of open, transparent formats make sit easier to use preserved digital content. In this respect XML encoding is better compared to proprietary word processor formats, and uncompressed TIFF is more transparent than company-developed compressed image formats. In a museum context, though, there is often a need to store advanced reproductions of objects and sites, and there have been problems to find open formats able to represent, in full, content exported from proprietary software packages. An example of this is CAD systems, where the open format SVG does not have the same expressive power as the proprietary DXF format (Westcott, 2005, p.6). It is generally a problem for applications using new formats, especially when they are heavily dependent upon presentation. Organisational Sustainability Although there is no sustainability without the technical part, described above, taken care of, the technical part alone is not enough. The organisation of the institution responsible for the information also has to be taken into consideration. In an information system for memory institutions it is important to store the history of the information. Digital versions of analogue sources should be stored as accurate replica, with new information linked to this set of historical data so that one always has access to up-to-date versions of the information, as well as to historical stages in the development of the information (Holmenetal.,2004, p.223). To actually store the files, the most important necessity is large, stable organisations taking responsibility. If the responsible institution is closed without a correct transfer of custody for the digital material, it can be lost easily. An example of this is the Newham Archive (Dunning, 2001) incident. When the Newham Museum Archaeological Service was closed down, only a quick and responsible reaction of the sacked staff saved the result of ten years of work in the form of a data dump on floppies. Even when the data files are kept, lack of necessary metadata may render them hard to interpret. In the Newham case the data were physically saved but a lot of work was needed to read the old data formats, and some of the information was not recoverable. Similar situations may even occur in large, stable organisations. The Bryggen Museum in Bergen, Norway, is a part of the University Museum in Bergen and documents the large excavation of the medieval town at the harbour in Bergen which took place from the 1950s to the 1970s. The museum stored the information in a large database. Eventually the system became obsolete and the database files were stored at the University. But there were no explicit routines for the packing and future unpacking of such digital information. Later, when the files were imported into a new system, parts of the original information were not recovered. Fortunately all the excavation documentation was originally done on paper so in principle no information was lost. _____________________________________________________________________________ 22 Digital Humanities 2008 _____________________________________________________________________________ Such incidents are not uncommon in the museum world. A general problem, present in both examples above, is the lack of metadata.The scope of each database table and column is well known when a system is developed, but if it is not documented, such meta-information is lost. In all sectors there is a movement away from paper to born digital information. When born digital data based on archaeological excavations is messed up or lost – and we are afraid this will happen – then parts of our cultural heritage are lost forever. An archaeological excavation destroys its own sources and an excavation cannot be repeated. For many current excavation projects a loss of data like the Bryggen Museum incident would have been a real catastrophe. The Newham example demonstrates weak planning for negative external effects on information sustainability, whereas the Bergen example shows how a lack of proper organisational responsibility for digital information may result in severe information loss. It is our impression that in many memory institutions there is too little competence on how to introduce information technologies in an organisation to secure both interchange of information between different parts of the organisation and long-term sustainability of the digital information. A general lack of strategies for long term preservation is documented in a recent Norwegian report (Gausdal, 2006, p.23). When plans are made in order to introduce new technology and information systems into an organisation one has to adapt the system to the organisation or the organisation to the system. This is often neglected and the information systems are not integrated in the everyday work of the staff. Thus, the best way to success is to do this in collaboration and understanding with the employees. This was pointed out by Professor Kristen Nygaard. In a paper published in 1992 describing the uptake of Simula I from 1965 onwards, Nygaard states: “It was evident that the Simula-based analyses were going to have a strong influence on the working conditions of the employees: job content, work intensity and rhythm, social cooperation patterns were typical examples” (Nygaard, 1992, p. 53). Nygaard focused on the situation in the ship building industry, which may be somewhat distant from the memory institutions. Mutate mutandis, the human mechanisms are the same. There is always a risk of persons in the organisation sabotaging or neglecting new systems. Scholarly Sustainability When a research project is finished, many researchers see the report or articles produced as the only output, and are confident that the library will take care of their preservation. But research in the humanities and beyond are often based on material collected by the researcher, such as ethnographic objects, sound recordings, images, and notes. The scholarly conclusions are then based on such sources. To sustain links from sources to testable conclusions, they have to be stored so that they are accessible to future researchers. But even in museums, this is often done only in a partial manner. Objects may find their way into the collections. But images, recordings and notes are often seen as the researcher’s private property and responsibility, and may be lost when her career is ended. Examples of this are hard to document, though, because such decisions are not generally made public. Conclusion Sustainability of data in the cultural heritage sector is, as we have seen, not just a technical challenge. The sustainability is eased by the use of open and transparent standards. It is necessary to ensure the existence of well funded permanent organisation like national archives and libraries. Datasets from museums are often not considered to lie within the preservation scope of the existing organisations. Either this has to be changed or the large museums have to act as digital archives of museum data in general. However, the most important measure to ensure sustainability is to increase the awareness of the challenge among curators and scholars. If not, large amounts of irreplaceable research documentation will continue to be lost. References DPC(2001): “Digital preservation coalition”. http://www. dpconline.org/graphics. Dunning, Alastair(2001): “Excavating data: Retrieving the Newham archive”. http://ahds.ac.uk/creating/case-studies/ newham/. Gausdal, Ranveig Låg (editor) (2006): Cultural heritage for all — on digitisation, digital preservation and digital dissemination in the archive, library and museum sector. A report by the Working Group on Digitisation, the Norwegian Digital Library. ABM-Utvikling. Holmen, Jon; Ore, Christian-Emil and Eide,Øyvind(2004): “Documenting two histories at once: Digging into archaeology”. In: Enter the Past. The E-way into the Four Dimensions of Cultural Heritage. BAR, BAR International Series 1227, pp. 221-224 LC(2007): “The library of congress. Digital preservation. http://www.digitalpreservation.gov. Nygaard, Kristen(1992): “How many choices do we make? How many are difficult? In: Software Development and Reality Construction, edited by Floyd C., Züllighoven H., Budde R., and R., Keil-Slawik. Springer-Verlag, Berlin, pp. 52-59. UNESCO (2003): “Charter on the preservation of digital heritage. Adopted at the 32nd session of the 9/10 general conference of UNESCO” Technical Report, UNESCO. http:// portal.unesco.org/ci/en/files/13367/10700115911Charter_ en.pdf/Charter_en.pdf. Westcott, Keith(2005): Preservation Handbook. Computer Aided Design (CAD). Arts and Humanities Data Service. http:// ahds.ac.uk/preservation/cad-preservation-handbook.pdf. _____________________________________________________________________________ 23 Digital Humanities 2008 _____________________________________________________________________________ Paper 2: Sustaining Collection Value: Managing Collection/Item Metadata Relationships Allen H. Renear, Richard Urban, Karen Wickett, Carole L. Palmer, David Dubin University of Illinois at Urbana-Champaign, USA Introduction Collections of texts, images, artefacts, and other cultural objects are usually designed to support particular research and scholarly activities.Toward that end collections themselves, as well as the items in the collections, are carefully developed and described. These descriptions indicate such things as the purpose of the collection, its subject, the method of selection, size, nature of contents, coverage, completeness, representativeness, and a wide range of summary characteristics, such as statistical features. This information enables collections to function not just as aggregates of individual data items but as independent entities that are in some sense more than the sum of their parts, as intended by their creators and curators [Currall et al., 2004,Heaney, 2000,Lagoze et al., 2006, Palmer, 2004]. Collection-level metadata, which represents this information in computer processable form, is thus critical to the distinctive intellectual and cultural role of collections as something more than a set of individual objects. Unfortunately, collection-level metadata is often unavailable or ignored by retrieval and browsing systems, with a corresponding loss in the ability of users to find, understand, and use items in collections [Lee, 2000, Lee, 2003, Lee, 2005, Wendler, 2004]. Preventing this loss of information is particularly difficult, and particularly important, for “metasearch”, where itemlevel descriptions are retrieved from a number of different collections simultaneously, as is the case in the increasingly distributed search environment [Chistenson and Tennant, 2005, Dempsey, 2005, Digital Library Federation, 2005, Foulonneau et al., 2005, Lagoze et al., 2006, Warner et al., 2007]. The now familiar example of this challenge is the “‘on a horse’ problem”, where a collection with the collection-level subject “Theodore Roosevelt” has a photograph with the item-level annotation “on a horse” [Wendler, 2004] Item-level access across multiple collections (as is provided not only by popular Internet search engines, but also specialized federating systems, such as OAI portals) will not allow the user to effectively use a query with keywords “Roosevelt” and “horse” to find this item, or, if the item is retrieved using item-level metadata alone, to use collection-level information to identify the person on the horse as Roosevelt. The problem is more complicated and consequential than the example suggests and the lack of a systematic understanding of the nature of the logical relationships between collectionlevel metadata and item-level metadata is an obstacle to the development of remedies. This understanding is what is required not only to guide the development of contextaware search and exploitation, but to support management and curation policies as well. The problem is also timely: even as recent research continues to confirm the key role that collection context plays in the scholarly use of information resources [Brockman et al., 2001, Palmer, 2004], the Internet has made the context-free searching of multiple collections routine. In what follows we describe our plans to develop a framework for classifying and formalizing collection-level/item-level metadata relationships. This undertaking is part of a larger project, recently funded by US Institute for Museum and Library Services (IMLS), to develop tools for improved retrieval and exploitation across multiple collections.1 Varieties of Collection/Item Metadata Relationships In some cases the relationship between collection-level metadata and item-level metadata attributes appears similar to non-defeasible inheritance. For instance, consider the Dublin Core Collections Application Profile element marcrel: OWN, adapted from the MARC cataloging record standard. It is plausible that within many legal and institutional contexts whoever owns a collection owns each of the items in the collection, and so if a collection has a value for the marcrel: OWN attribute then each member of the collection will have the same value for marcrel:OWN. (For the purpose of our example it doesn’t matter whether or not this is actually true of marcrel:OWN, only that some attributes are sometimes used by metadata librarians with an understanding of this sort, while others, such as dc:identifier, are not). In other cases the collection-level/item-level metadata relationship is almost but not quite this simple. Consider the collection-level attribute myCollection:itemType, intended to characterize the type of objects in a collection, with values such as “image,” “text,” “software,” etc. (we assume heterogeneous collections).2 Unlike the preceding case we cannot conclude that if a collection has the value “image” for myCollection: itemType then the items in that collection also have the value “image” for that same attribute. This is because an item which is an image is not itself a collection of images and therefore cannot have a non-null value for myCollection:itemType. However, while the rule for propagating the information represented by myCollection:itemType from collections to items is not simple propagation of attribute and value, it is nevertheless simple enough: if a collection has a value, say “image,” for myCollection:itemType, then the items in the collection have the same value, “image” for a corresponding attribute, say, myItem:type, which indicates the type of item (cf. the Dublin Core metadata element dc:type). The attribute myItem:type has the same domain of values as myCollection: itemType, but a different semantics. _____________________________________________________________________________ 24 Digital Humanities 2008 _____________________________________________________________________________ When two metadata attributes are related as myCollection: itemType and myItem:type we might say the first can be vconverted to the other. Roughly: a collection-level attribute A v-converts to an item-level attribute B if and only if whenever a collection has the value z for A, every item in the collection has the value z for B. This is the simplest sort of convertibility — the attribute changes, but the value remains the same. Other sorts of conversion will be more complex. We note that the sort of propagation exemplified by marcrel:OWN is a special case of v-convertibility: marcrel:OWN v-converts to itself. from contributors. We conducted studies on how content contributors conceived of the roles of collection descriptions in digital environments [Palmer and Knutson, 2004, Palmer et al., 2006], and conducted preliminary usability work. These studies and related work on the CIC Metadata Portal7, suggest that while the boundaries around digital collections are often blurry, many features of collections are important for helping users navigate and exploit large federated repositories, and that collection and item-level descriptions should work in concert to benefit certain kinds of user queries [Foulonneau et al., 2005]. This analysis suggests a number of broader issues for collection curators. Obviously the conversion of collection-level metadata to item-level metadata, when possible, can improve discovery and exploitation, especially in item-focused searching across multiple collections. But can we even in the simplest case be confident of conversion without loss of information? For example, it may be that in some cases an “image” value for myCollection:itemType conveys more information than the simple fact that each item in the collection has “image” value for myItem:type. In 2007 we received a new three year IMLS grant to continue the development of the registry and to explore how a formal description of collection-level/item-level metadata relationships could help registry users locate and use digital items. This latter activity, CIMR, (Collection/Item Metadata Relationships), consists of three overlapping phases. The first phase is developing a logic-based framework of collection-level/ itemlevel metadata relationships that classifies metadata into varieties of convertibility with associated rules for propagating information between collection and item levels and supporting further inferencing. Next we will conduct empirical studies to see if our conjectured taxonomy matches the understanding and behavior of metadata librarians, metadata specification designers, and registry users. Finally we will design and implement pilot applications using the relationship rules to support searching, browsing, and navigation of the DCC Registry. These applications will include non-convertible and convertible collection-level/item-level metadata relationships. Moreover there are important collection-level attributes that both (i) resist any conversion and (ii) clearly result in loss of important information if discarded. Intriguingly these attributes turn out to be carrying information that is very tightly tied to the distinctive role the collection is intended to play in the support of research and scholarship. Obvious examples are metadata indicating that a collection was developed according to some particular method, designed for some particular purpose, used in some way by some person or persons in the past, representative (in some respect) of a domain, had certain summary statistical features, and so on.This is precisely the kind of information that makes a collection valuable to researchers, and if it is lost or inaccessible, the collection cannot be useful, as a collection, in the way originally intended by its creators. The DCC/CIMR Project These issues were initially raised during an IMLS Digital Collections and Content (DCC) project, begun at the University of Illinois at Urbana-Champaign in 2003. That project developed a collection-level metadata schema3 based on the RSLP4 and Dublin Core Metadata Initiative (DCMI) and created a collection registry for all the digital collections funded through the Institute of Museum and Library Services National Leadership Grant (NLG) since 1998, with some Library Services and Technology Act (LSTA) funded collections included since 20065. The registry currently contains records for 202 collections. An item-level metadata repository was also developed, which so far has harvested 76 collections using the OAI-PMH protocol6. One outcome of this project will be a proposed specification for a metadata classification code that will allow metadata specification designers to indicate the collectionlevel/itemlevel metadata relationships intended by their specification. Such a specification will in turn guide metadata librarians in assigning metadata and metadata systems designers in designing systems that can mobilize collection level metadata to provide improved searching, browsing, understanding, and use by end users. We will also draft and make electronically available RDF/OWL bindings for the relationship categories and inference rules. Preliminary Guidance for Practitioners A large part of the problem of sustainability is ensuring that information will be valuable, and as valuable as possible, to multiple audiences, for multiple purposes, via multiple tools, and over time. Although we have only just begun this project, already some preliminary general recommendations can be made to the different stakeholders in collection management. Note that tasks such as propagation must be repeated not only as new objects are added or removed but, as new information about objects and collections becomes available. Our research initially focused on overcoming the technical challenges of aggregating large heterogeneous collections of item-level records and gathering collections descriptions _____________________________________________________________________________ 25 Digital Humanities 2008 _____________________________________________________________________________ For metadata standards developers: 1. Metadata standards should explicitly document the relationships between collectionlevel metadata and itemlevel metadata. Currently we have neither the understanding nor the formal mechanisms for such documentation but they should be available soon. For systems designers: 2. Information in convertible collection-level metadata should be propagated to items in order to make contextual information fully available to users, especially users working across multiple collections. (This is not a recommendation for how to manage information internally, but for how to represent it to the user; relational tables may remain in normal forms.) 3. Information in item-level metadata should, where appropriate, be propagated to collection level metadata. 4. Information in non-convertible collection-level metadata must, to the fullest extent possible, be made evident and available to users. For collection managers: 5. Information in non-convertible metadata must be a focus of data curation activities if collections are to retain and improve their usefulness over time. When formal specifications and tools based on them are in place, relationships between metadata at the collection and item levels will integrated more directly into management and use. In the mean time, attention and sensitivity to the issues we raise here can still improve matters through documentation and policies, and by informing system design. Acknowledgments This research is supported by a 2007 IMLS NLG Research & Demonstration grant hosted by the GSliS Center for Informatics Research in Science and Scholarship (CIRSS). Project documentation is available at http://imlsdcc.grainger. uiuc. edu/about.asp#documentation. We have benefited considerably from discussions with other DCC/CIMR project members and with participants in the IMLS DCC Metadata Roundtable, including: Timothy W. Cole, Thomas Dousa, Dave Dubin, Myung-Ja Han, Amy Jackson, Mark Newton , Carole L. Palmer, Sarah L. Shreeves, Michael Twidale, Oksana Zavalina References [Brockman et al., 2001] Brockman, W., Neumann, L., Palmer, C. L., and Tidline, T. J. (2001). Scholarly Work in the Humanities and the Evolving Information Environment. Digital Library Federation/Council on Library and Information Resources, Washington, D.C. [Chistenson and Tennant, 2005] Chistenson, H. and Tennant, R. (2005). Integrating information resources: Principles, technologies, and approaches. Technical report, California Digital Library. [Currall et al., 2004] Currall, J., Moss, M., and Stuart, S. (2004). What is a collection? Archivaria, 58:131–146. [Dempsey, 2005] Dempsey, L. (2005). From metasearch to distributed information environments. Lorcan Dempsey’s weblog. Published on the World Wide Web at http:// orweblog.oclc.org/archives/000827.html. [Digital Library Federation, 2005] Digital Library Federation (2005). The distributed library: OAI for digital library aggregation: OAI scholars advisory panel meeting, June 20–21,Washington, D.C. Published on the World Wide Web at http:// www.diglib. org/architectures/oai/imls2004/OAISAP05.htm. [Foulonneau et al., 2005] Foulonneau, M., Cole, T., Habing, T. G., and Shreeves, S. L. (2005). Using collection descriptions to enhance an aggregation of harvested itemlevel metadata. In ACM, editor, Proceedings of the 5th ACM/IEEE-CS Joint Conference on Digital Libraries, pages 32–41, New York. ACM/ IEEE-CS, ACM Press. [Heaney, 2000] Heaney, M. (2000). An analytical model of collections and their catalogues. Technical report, UK Office for Library and Information Science. [Lagoze et al., 2006] Lagoze, C., Krafft, D., Cornwell, T., Dushay, N., Eckstrom, D., and Saylor, J. (2006). Metadata aggregation and automated digital libraries: A retrospective on the NSDL experience. In Proceedings of the 6th ACM/IEEECS Joint Conference on Digital Libraries, pages 230–239, New York. ACM/IEEE-CS, ACM Press. [Lee, 2000] Lee, H. (2000). What is a collection? Journal of the American Society for Information Science, 51(12):1106–1113. [Lee, 2003] Lee, H. (2003). Information spaces and collections: Implications for organization. Library & Information Science Research, 25(4):419–436. [Lee, 2005] Lee, H. (2005). The concept of collection from the user’s perspective. Library Quarterly, 75(1):67–85. [Palmer, 2004] Palmer, C. (2004). Thematic research collections, pages 348–365. Blackwell, Oxford. [Palmer and Knutson, 2004] Palmer, C. L. and Knutson, E. (2004). Metadata practices and implications for federated collections. In Proceedings of the 67th ASIS&T Annual Meeting (Providence, RI, Nov. 12–17, 2004), volume 41, pages 456–462. [Palmer et al., 2006] Palmer, C. L., Knutson, E., Twidale, M., and Zavalina, O. (2006). Collection definition in federated digital resource development. In Proceedings of the 69th ASIS&T Annual Meeting (Austin,TX, Nov. 3–8, 2006), volume 43. _____________________________________________________________________________ 26 Digital Humanities 2008 _____________________________________________________________________________ [Warner et al., 2007] Warner, S., Bakaert, J., Lagoze, C., Lin, X., Payette, S., and Van de Sompel, H. (2007). Pathways: Augmenting interoperability across scholarly repositories. International Journal on Digital Libraries, 7(1-2):35–52. it has been locked away in an academic’s hidden vault, or the person who developed the annotation format can no longer be asked questions concerning specific details of the custombuilt annotation format (Schmidt et al., 2006). [Wendler, 2004] Wendler, R. (2004). The eye of the beholder: Challenges of image description and access at Harvard., pages 51–56. American Library Association, Chicago. Linguistic Resources: Aspects of Sustainability Notes There are several text types that linguists work and interact with on a frequent basis, but the most common, by far, are linguistic corpora (Zinsmeister et al., 2007). In addition to rather simple word and sentence collections, empirical sets of grammaticality judgements, and lexical databases, the linguistic resources our sustainability project is primarily confronted with are linguistic corpora that contain either texts or transcribed speech in several languages; they are annotated using several incompatible annotation schemes. We developed XML-based tools to normalise the existing resources into a common approach of representing linguistic data (Wörner et al., 2006, Witt et al., 2007b) and use interconnected OWL ontologies to represent knowledge about the individual annotation schemes used in the original resources (Rehm et al., 2007a). 1 Information about the IMLS Digital Collections and Content project can be found at: http://imlsdcc.grainger.uiuc.edu/about.asp. 2 In our examples we will use imaginary metadata attributes. The namespace prefix “myCollection:” indicates collection-level attributes and the prefix“myItem:” indicates item-level attributes.No assumptions should be made about the semantics of these attributes other than what is stipulated for illustration. The current example, myCollection: itemType, does intentionally allude to cld:itemType in the Dublin Core Collections Application Profile, and “image,” “text,” “software,” are from the DCMI Type Vocabulary; but our use of myCollection:itemType differs from cld:itemType in entailing that all of the items the collection are of the indicated type. 3 General overview and detailed description of the IMLS DCC collection description scheme are available at: http://imlsdcc.grainger. uiuc.edu/CDschema_overview.asp 4 http://www.ukoln.ac.uk/metadata/rslp/ Currently, the most central aspects of sustainability for linguistic resources are: 5 http://www.imls.gov/ • markup languages 6 http://www.openarchives.org/OAI/openarchivesprotocol.html • metadata encoding 7 http://cicharvest.grainger.uiuc.edu/ Paper 3: Sustainability of Annotated Resources in Linguistics Georg Rehm, Andreas Witt, Erhard Hinrichs, Marga Reis Introduction In practically all scientific fields the task of ensuring the sustainability of resources, data collections, personal research journals, and databases is an increasingly important topic – linguistics is no exception (Dipper et al., 2006, Trilsbeek and Wittenburg, 2006). We report on ongoing work in a project that is concerned with providing methods, tools, bestpractice guidelines, and solutions for sustainable linguistic resources. Our overall goal is to make sure that a large and very heterogeneous set of ca. 65 linguistic resources will be accessible, readable, and processible by interested parties such as, for example, other researchers than the ones who originally created said resources, in five, ten, or even 20 years time. In other words, the agency that funded both our project as well as the projects who created the linguistic resources – the German Research Foundation – would like to avoid a situation in which they have to fund yet another project to (re)create a corpus for whose creation they already provided funding in the past, but the “existing” version is no longer available or readable due to a proprietary file format, because • legal aspects (Zimmermann and Lehmberg, 2007, Lehmberg et al., 2007a,b, Rehm et al., 2007b,c, Lehmberg et al., 2008), • querying and search (Rehm et al., 2007a, 2008a, Söhn et al., 2008), and • best-practice guidelines (see, for example, the general guidelines mentioned by Bird and Simons, 2003). None of these points are specific to the field of linguistics, the solutions, however, are. This is exemplified by means of two of these aspects. The use of markup languages for the annotation of linguistic data has been discussed frequently.This topic is also subject to standardisation efforts. A separate ISO Group, ISO TC37 SC4, deals with the standardisation of linguistic annotations. Our project developed an annotation architecture for linguistic corpora. Today, a linguistic corpus is normally represented by a single XML file. The underlying data structures most often found are either trees or unrestricted graphs. In our approach we transform an original XML file to several XML files, so that each file contains the same textual content. The markup of these files is different. Each file contains annotations which belong to a single annotation layer. A data structure usable to model the result documents is a multi-rooted tree. (Wörner et al., 2006, Witt et al., 2007a,b, Lehmberg and Wörner, 2007). _____________________________________________________________________________ 27 Digital Humanities 2008 _____________________________________________________________________________ The specificities of linguistic data also led to activities in the field of metadata encoding and its standardisation. Within our project we developed an approach to handle the complex nature of linguistic metadata (Rehm et al., 2008b) which is based on the metadata encoding scheme described by the TEI. (Burnard and Bauman, 2007). This method of metadata representation splits the metadata into the 5 different levels the primary information belongs to. These levels are: (1) setting, i.e. the situation in which the speech or dialogue took place; (2) raw data, e.g., a book, a piece of paper, an audio or video recording of a conversation etc.; (3) primary data, e.g., transcribed speech, digital texts etc.; (4) annotations, i.e., (linguistic) markup that add information to primary data; and (5) corpus, i.e. a collection of primary data and its annotations. All of these aspects demonstrate that it is necessary to use field specific as well as generalised methodologies to approach the issue “Sustainability of Linguistic Resources”. References Bird, S. and Simons, G. (2003). Seven dimensions of portability for language documentation and description. Language, 79:557–582. Burnard, L. and Bauman, S., editors (2007). TEI P5: Guidelines for Electronic Text Encoding and Interchange. Text Encoding Initiative Consortium. Dipper, S., Hinrichs, E., Schmidt, T., Wagner, A., and Witt, A. (2006). Sustainability of Linguistic Resources. In Hinrichs, E., Ide, N., Palmer, M., and Pustejovsky, J., editors, Proceedings of the LREC 2006 Satellite Workshop Merging and Layering Linguistic Information, pages 48–54, Genoa, Italy. Lehmberg, T., Chiarcos, C., Hinrichs, E., Rehm, G., and Witt, A. (2007a). Collecting Legally Relevant Metadata by Means of a Decision-Tree-Based Questionnaire System. In Schmidt, S., Siemens, R., Kumar, A., and Unsworth, J., editors, Digital Humanities 2007, pages 164–166, Urbana-Champaign, IL, USA. ACH, ALLC, Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign. Lehmberg, T., Chiarcos, C., Rehm, G., and Witt, A. (2007b). Rechtsfragen bei der Nutzung und Weitergabe linguistischer Daten. In Rehm, G., Witt, A., and Lemnitzer, L., editors, Datenstrukturen für linguistische Ressourcen und ihre Anwendungen – Data Structures for Linguistic Resources and Applications: Proceedings of the Biennial GLDV Conference 2007, pages 93–102. Gunter Narr, Tübingen. Lehmberg, T., Rehm, G., Witt, A., and Zimmermann, F. (2008). Preserving Linguistic Resources: Licensing – Privacy Issues – Mashups. Library Trends. In print. Lehmberg, T. and Wörner, K. (2007). Annotation Standards. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics, Handbücher zur Sprach-und Kommunikationswissenschaft (HSK). de Gruyter, Berlin, New York. In press. Rehm, G., Eckart, R., and Chiarcos, C. (2007a). An OWL-and XQuery-Based Mechanism for the Retrieval of Linguistic Patterns from XML-Corpora. In Angelova, G., Bontcheva, K., Mitkov, R., Nicolov, N., and Nikolov, N., editors, International Conference Recent Advances in Natural Language Processing (RANLP 2007), pages 510–514, Borovets, Bulgaria. Rehm, G., Eckart, R., Chiarcos., C., Dellert, J. (2008a). Ontology-Based XQuery’ing of XML-Encoded Language Resources on Multiple Annotation Layer. In Proceedings of LREC 2008, Marrakech, Morocco. Rehm, G., Schonefeld, O., Witt, A., Lehmberg, T., Chiarcos, C., Bechara, H., Eishold, F., Evang, K., Leshtanska, M., Savkov, A., and Stark, M. (2008b). The Metadata-Database of a Next Generation Sustainability Web-Platform for Language Resources. In Proceedings of LREC 2008, Marrakech, Morocco. Rehm, G., Witt, A., Zinsmeister, H., and Dellert, J. (2007b). Corpus Masking: Legally Bypassing Licensing Restrictions for the Free Distribution of Text Collections. In Schmidt, S., Siemens, R., Kumar, A., and Unsworth, J., editors, Digital Humanities 2007, pages 166–170, Urbana-Champaign, IL, USA. ACH, ALLC, Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign. Rehm, G., Witt, A., Zinsmeister, H., and Dellert, J. (2007c). Masking Treebanks for the Free Distribution of Linguistic Resources and Other Applications. In Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories (TLT 2007), number 1 in Northern European Association for Language Technology Proceedings Series, pages 127–138, Bergen, Norway. Schmidt, T., Chiarcos, C., Lehmberg, T., Rehm, G., Witt, A., and Hinrichs, E. (2006). Avoiding Data Graveyards: From Heterogeneous Data Collected in Multiple Research Projects to Sustainable Linguistic Resources. In Proceedings of the EMELD 2006 Workshop on Digital Language Documentation:Tools and Standards – The State of the Art, East Lansing, Michigan. Söhn, J.-P., Zinsmeister, H., and Rehm, G. (2008). Requirements of a User-Friendly, General-Purpose Corpus Query Interface. In Proceedings of LREC 2008 workshop on Sustainability of Language Resources and Tools for Natural Language Processing, Marrakech, Morocco. Trilsbeek, P. and Wittenburg, P. (2006). Archiving Challenges. In Gippert, J., Himmelmann, N. P., and Mosel, U., editors, Essentials of Language Documentation, pages 311–335. Mouton de Gruyter, Berlin, New York. _____________________________________________________________________________ 28 Digital Humanities 2008 _____________________________________________________________________________ Witt, A., Rehm, G., Lehmberg, T., and Hinrichs, E. (2007a). Mapping Multi-Rooted Trees from a Sustainable Exchange Format to TEI Feature Structures. In TEI@20: 20 Years of Supporting the Digital Humanities. The 20th Anniversary Text Encoding Initiative Consortium Members’ Meeting, University of Maryland, College Park. Witt, A., Schonefeld, O., Rehm, G., Khoo, J., and Evang, K. (2007b). On the Lossless Transformation of Single-File, Multi-Layer Annotations into Multi-Rooted Trees. In Usdin, B. T., editor, Proceedings of Extreme Markup Languages 2007, Montréal, Canada. Wörner, K., Witt, A., Rehm, G., and Dipper, S. (2006). Modelling Linguistic Data Structures. In Usdin, B. T., editor, Proceedings of Extreme Markup Languages 2006, Montréal, Canada. Zimmermann, F. and Lehmberg, T. (2007). Language Corpora – Copyright – Data Protection: The Legal Point of View. In Schmidt, S., Siemens, R., Kumar, A., and Unsworth, J., editors, Digital Humanities 2007, pages 162–164, Urbana-Champaign, IL, USA. ACH, ALLC, Graduate School of Library and Information Science, University of Illinois, Urbana-Champaign. Zinsmeister, H., Kübler, S., Hinrichs, E., and Witt, A. (2008). Linguistically Annotated Corpora: Quality Assurance, Reusability and Sustainability. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics, HSK. de Gruyter, Berlin etc. In print. ALLC SESSION: e-Science: New collaborations between information technology and the humanities Speakers David Robey d.j.b.robey@reading.ac.uk Arts and Humanities Research Council, UK Stuart Dunn stuart.dunn@kcl.ac.uk King’s College London, UK Laszlo Hunyadi hunyadi@ling.arts.unideb.hu University of Debrecen, Hungary Dino Buzzetti buzetti@philo.unibo.it University of Bologna, Italy e-Science in the UK and elsewhere stands for a broad agenda as important for the humanities as it is for the natural sciences: to extend the application of advanced information technologies to develop new kinds of research across the whole range of academic disciplines, particularly through the use of internetbased resources. The aim of this session is to introduce some recent UK developments in this area, side by side with related developments in other parts of Europe. • David Robey: Introduction • Stuart Dunn: e-Science developments in the UK: temporal mapping and location-aware web technologies for the humanities • Laszlo Hunyadi:Virtual Research Organizations for the Humanities in Europe: technological challenges, needs and opportunities • Dino Buzzetti: Interfacing Biological and Textual studies in an e-Science Environment _____________________________________________________________________________ 29 Digital Humanities 2008 _____________________________________________________________________________ Understanding TEI(s): A Presentation and Discussion Session Chairs Susan Schreibman sschreib@umd.edu University of Maryland, USA Ray Siemens membership organization composed of academic institutions, research projects, and individual scholars from around the world. For more information, see http://www.tei-c.org/. This panel provides an opportunity beyond the project- and researcher-centred papers of the DH2008 conference for members of the TEI community to present on and discuss aspects of the TEI community itself, among them the nature and influence of Special Interest Groups (SIGs), recent trends in the integration of TEI with other schema and systems, issues of the annual conference and community support, and education and outreach now, and in the future. siemens@uvic.ca University of Victoria, Canada Speakers will include Peter Boot peter.boot@huygensinstituut.knaw.nl Huygens Institute The Netherlands Arianna Ciula arianna.ciula@kcl.ac.uk King’s College London, UK James Cummings James.Cummings@oucs.ox.ac.uk University of Oxford, UK Kurt Gärtner gaertnek@staff.uni-marburg.de Universität Trier, Germany Martin Holmes mholmes@uvic.ca University of Victoria, Canada John Walsh jawalsh@indiana.edu Indiana University, USA The Text Encoding Initiative (TEI) is a consortium which collectively develops and maintains a standard for the representation of texts in digital form. Its chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, primarily in the humanities, social sciences and linguistics. Since 1994, the TEI Guidelines have been widely used by libraries, museums, publishers, and individual scholars to present texts for online research, teaching, and preservation. In addition to the Guidelines themselves, the Consortium provides a variety of supporting resources, including resources for learning TEI <http://www. tei-c.org/Support/Learn/>, information on projects using the TEI <http://www.tei-c.org/Activities/Projects/>, TEIrelated publications <http://www.tei-c.org/Support/Learn/>, and software <http://www.tei-c.org/Tools/> developed for or adapted to the TEI. The TEI Consortium is a non-profit _____________________________________________________________________________ 30 Digital Humanities 2008 _____________________________________________________________________________ DH2008: ADHO Session ‘Digital resources in humanities research: Evidence of value (2)’ activities, of digital tools development projects, and in general of the impact of of digital methods on research in the arts and humanities. Lorna Hughes and David Robey will discuss the results of the evaluation process at the end of the AHRC ICT Methods Network at King’s College London, and of related work on a set of resource-development projects funded by the AHRC ICT in Arts and Humanities Research Programme. Chair John Unsworth and David Hoover will take a somewhat more anecdotal approach, and one that emphasizes North America rather than the UK. They will focus on a variety of ways of assessing and enhancing the value of digital humanities research in the areas of access, analysis, and advancing one’s career. What kinds and levels of access to digital material have the most impact? What kinds of analysis and presentation, and what venues of publication or dissemination are most persuasive and effective? How does or can the exploitation of digital materials enhance the career of the (digital) humanist? Harold Short harold.short@kcl.ac.uk King’s College London, UK Panelists David Hoover david.hoover@nyu.edu New York University, USA Lorna Hughes lorna.hughes@kcl.ac.uk King’s College London, UK We hope that participants from other countries will contribute their points of view in the discussion. David Robey d.j.b.robey@reading.ac.uk Arts and Humanities Research Council, UK John Unsworth unsworth@uiuc.edu University of Illinois, Urbana-Champaign, USA This takes further the issues discussed at the ALLC session at DH2007. While most of us who do humanities computing need no convincing of its value, academic colleagues - including those on appointment and promotion panels - still need to be convinced, and even more so funders. If we want backing for the use and further development of digital resources, both data and processes, we need to collect more extensive concrete evidence of the ways in which they enable us to do research better, or to do research we would not otherwise be able to do, or to generate new knowledge in entirely new ways. Since the value of humanities research as a whole is qualitative, not quantitative, it is qualitative evidence in particular we should be looking for: digital resources providing the means not simply of doing research, but of doing excellent research. The DH2007 panel session discussed a wide range of general issues arising in the discussion of the value of humanities computing, both in terms of its impact and results, and in terms of the intrinsic structures and qualities of digital objects created for research purposes. The present panel session takes this further one the one hand by presenting recent work in the UK that has systematically tried to capture the value of humanities computing support _____________________________________________________________________________ 31 Digital Humanities 2008 _____________________________________________________________________________ The Building Blocks of the New Electronic Book Ray Siemens siemens@uvic.ca University of Victoria, Canada Claire Warwick c.warwick@ucl.ac.uk University College London, UK Kirsten C. Uszkalo kirsten@uszkalo.com St. Francis Xavier University, Canada Stan Ruecker sruecker@ualberta.ca University of Alberta, Canada Paper 1: A New Context for the Electronic Book Ray Siemens The power of the book has preoccupied readers since before the Renaissance, a crucial time during which the human record entered print and print, in turn, facilitated the radical, crucial social and societal changes that have shaped our world as we know it. It is with as much potential and as much complexity, that the human record now enters the digital world of electronic media. At this crucial time, however, we must admit to some awkward truths. The very best of our electronic books are still pale reflections of their print models, and the majority offer much less than a reflection might. Even the electronic document – the webpage mainstay populating the World Wide Web – does not yet afford the same basic functionality, versatility, and utility as does the printed page. Nevertheless, more than half of those living in developed countries make use of the computer and the internet to read newspaper pieces, magazine and journal articles, electronic copies of books, and other similar materials on a daily basis; the next generation of adults already recognises the electronic medium as their chief source of textual information; our knowledge repositories increasingly favour digital products over the print resources that have been their mainstay for centuries; and professionals who produce and convey textual information have as a chief priority activities associated with making such information available, electronically – even if they must do so in ways that do not yet meet the standards of functionality, interactivity, and fostering of intellectual culture that have evolved over 500 years of print publication. Why, then, are our electronic documents and books not living up to the promise realized in their print counterparts? What is it about the book that has made it so successful? How can we understand that success? How can we duplicate that success, in order to capitalize on economic and other advantages in the electronic realm? This introductory paper frames several responses to the above, centring on issues related to textual studies, reader studies, interface design, and information management, and framing the more detailed responses in the areas below. Paper 2: ‘Humanities scholars, research and reading, in physical and digital environments’ Claire Warwick It may initially seem strange to suggest that research on what seems to be such a traditional activity as reading might be a vital underpinning to the study of digital resources for the humanities. Yet, to design digital resources fit for research purposes, we must understand what scholars do in digital environments, what kind of resources they need, and what makes some scholars, particularly in the humanities, decline to use existing digital resources. One of the most important things that humanities scholars undertake in any environment is to read. Reading is the fundamental activity of humanities researchers, and yet we have little knowledge of how reading processes are modified or extended in new media environments. In the early to mid 1990s, many humanities scholars expressed excitement about the possibilities of electronic text, predicting that the experience of reading would change fundamentally (e.g., Bolter, 1991; Landow, 1992; Nunberg, 1996; Sutherland, 1997). Such prophetic writings, however, were not based on and rarely followed up by studies with readers, particularly readers in humanities settings. Through the last fifteen years critical interest within humanities circles with respect to reading has waned and little progress has been made in understanding how electronic textuality may affect reading practices, both of academic and non-academic readers. We are, however, beginning to understand the relationship between reading in print and online, and how connections may be made between the two (Blandford, Rimmer, & Warwick 2006), and how humanities scholars relate to information environments, by reading and information seeking. In this paper we will discuss our research to date, undertaken as part of the UCIS project on the users of digital libraries, and on the LAIRAH project, which studied levels of use of a variety of digital humanities resources. (http://www.ucl.ac.uk/slais/ research/circah/) Research has been based on a use-in- context approach, in which participants have been interviewed about their preferences for the use of digital resources in a setting which is as naturalistic as possible. We also report on the results of user workshops, and of one to one observations of the use of digital resources. As a result of these studies, we have found that humanities researchers have advanced information skills and mental models of their physical information environment, although they may differ in that they find these skills and models difficult _____________________________________________________________________________ 32 Digital Humanities 2008 _____________________________________________________________________________ to apply to the digital domain (Makri et al. 2007). Humanities researchers are aware of the available functions as well as the problems of digital environments, and are concerned with accuracy, selection methods, and ease of use (Warwick et al. 2007). They require information about the original item when materials are digitized and expect high-quality content: anything that makes a resource difficult to understand – a confusing name, a challenging interface, or data that must be downloaded – will deter them from using it (Warwick et al. 2006). It is therefore vital that such insights into the information behaviour of humanities scholars should be used to inform future studies of reading. It is also vital that this research is informed by the insights from other members of the wider research group, who work on textual studies and interface and resource design.The paper will therefore end with a short discussion of future plans for such collaboration. References Bolter, J. David. (1991). Writing space: the computer, hypertext, and the history of writing. Hillsdale, NJ: L. Erlbaum Associates. Blandford, A., Rimmer, J., & Warwick, C. (2006). Experiences of the library in the digital age. Paper presented at 3rd International Conference on Cultural Covergence and Digital Technology, Athens, Greece. Landow, G. P. (1992). Hypertext: the convergence of contemporary critical theory and technology. Baltimore, MD: Johns Hopkins University Press, 1992. Makri, S., Blandford, A., Buchanan, G., Gow, J., Rimmer, J., & Warwick, C. (2007). A library or just another information resource? A case study of users’ mental models of traditional and digital libraries. Journal of the American Society of Information Science and Technology, 58(3), 433-445. Nunberg, G (ed.) (1996) The Future of the Book. Berkeley, University of California Press. Sutherland, K. (Ed) (1997) Electronic Text: Investigations in Method and Theory. Oxford, Clarendon Press Warwick, C., Terras, M., Huntington, P., & Pappa, N. (2007) “If you build it will they come?” The LAIRAH study: Quantifying the use of online resources in the Arts and Humanities through statistical analysis of user log data. Literary and Linguistic Computing, forthcoming Paper 3: A Book is not a Display: A Theoretical Evolution of the E-Book Reader Kirsten C. Uszkalo and Stan Ruecker In our part of this panel session, we will discuss the design of the physical electronic book—the reading device—from a user perspective. There has been a tendency on the part of producers of information to disassociate the information from its media. This makes sense in a digital age: there are inefficiencies in production that can be made more efficient by remediation.Additionally, there is little intrinsic justification for preferring one medium over another, unless one wishes to take into account the situated context of use. Designers, however, insofar as they are distinct from producers of information, have had a longstanding tendency to become absorbed in the situated context of use. When the conventions of HTML were determined in such a way that they discounted formatting, and for very good reasons of cross-platform compatibility, designers began the uphill battle of changing the conventions, because they felt that formatting, and not just content, was important. For example, Travis Albers says of his recent project Bookglutton. com: “BookGlutton is working to facilitate adoption of on-line reading. Book design is an important aspect of the reader, and it incorporates design elements, like dynamic dropcaps.” The issue at stake with existing electronic book readers is their function as an interface, which can, according to Gui Bonsiepe, create the “possibilities for effective action or obstruct them” (62). Interface issues like eye strain, contrast, and resolution are being addressed through the use of E-Ink’s electronic paper in Sony’s 2007 / 2007 Sony E Book, Bookeen, iRex Technologies, and Amazon’s Kindle (2008) electronic book readers. Samsung and LG Philips are addressing object size of ebook readers with 14 inch lightweight epaper formats; and Polymer Vision’s Readius, “a mobile device” is a “flexible 5-inch e-paper display that unfurls like a scroll.” However, these formats lack the intuitive, familiar feeling of the book. Acknowledging that ebook proprietary software issues, copyright, and cost might keep people from investing in a reading device in order to be able to read a book, and noting the continued superiority of the paper book in both form and function over the electronic book reader, we will look to the range of design issues which continue to plague the potential viable sales for the electronic book reader. _____________________________________________________________________________ 33 Digital Humanities 2008 _____________________________________________________________________________ References Balas, Janet L. “Of iPhones and Ebooks: Will They Get Together?” Computers in Libraries. Westport: Nov/Dec 2007. Vol. 27, Iss. 10; pg. 35, 4 pgs Bina. “DON’T Upgrade the REB 1100!!,” Published November 6, 2004. Accessed November 17, 2007. http:// www.amazon.com/RCA-REB1100-eBook-Reader/dp/ B00005T3UH Geist, Michael. “Music Publisher’s Takedown Strikes The Wrong Chord” Published Tuesday October 30, 2007. Accessed November 17, 2007 http://www.michaelgeist. ca/content/view/2336/135/ SDH/SEMI panel:Text Analysis Developers’ Alliance (TADA) and T-REX Organiser: Stéfan Sinclair sgsinclair@gmail.com McMaster University, Canada Chair: Ray Siemens Teaching and learning: Dual educational electronic textbooks: the starlight platform siemens@uvic.ca University of Victoria, Canada Grammenos, Dimitris. et al “Teaching and learning: Dual educational electronic textbooks: the starlight platform” Proceedings of the 9th international ACM SIGACCESS conference on Computers and accessibility Assets ‘07 October 2007 Participants: Hodge, Steve, et al. “ThinSight: versatile multi-touch sensing for thin form-factor displays” Symposium on User Interface Software and Technology archive. Proceedings of the 20th annual ACM, Newport, Rhode Island, USA, 2007, pp. 259 - 268 Matt Jockers Farber, Dan. “Sony’s misguided e-book” Published February 22nd, 2006. Accessed November 17, 2007. http://blogs.zdnet. com/BTL/?p=2619 Forest, Brady. “Illiad, a new ebook reader” Published September 21, 2006. Accessed November 17, 2007 http:// radar.oreilly.com/archives/2006/09/illiad_a_new_ebook_ reader.html Mangalindan, Mylene. “Amazon to Debut E-Book Reader” Wall Street Journal. Published November 16, 2007; Page B2. Morrison, Chris. “10 Taking a Page Out of the E-Book.” Business 2.0. San Francisco: Oct 2007.Vol. 8, Iss. 9; pg. 34 Oswald, Ed “Sony Updates Electronic Book Reader” Published October 2, 2007, 2:50. Accessed November 17, 2007. <http://www.betanews.com/article/Sony_Updates_ Electronic_Book_Reader/119135092> Stone, Brad. “Envisioning the Next Chapter for Electronic Books” Published: September 6, 2007. Accessed November 17, 2007. http://www.nytimes.com/2007/09/06/technology/ 06amazon.html Stéfan Sinclair sgsinclair@gmail.com McMaster University, Canada mjockers@stanford.edu Stanford University, USA Susan Schreibman sschreib@umd.edu University of Maryland, USA Patrick Juola juola@mathcs.duq.edu Duquesne University, USA David Hoover david.hoover@nyu.edu New York University, USA Jean-Guy Meunier meunier.jean-guy@uqam.ca Université du Québec à Montréal, Canada Dominic Forest dominic.forest@umontreal.ca University of Montreal, Canada The Text Analysis Developers’ Alliance (TADA) is an informal affiliation of designers, developers, and users of text analysis tools in the digital humanities. The TADA wiki (http://tada. mcmaster.ca/) has steadily grown since the group’s first meeting three years ago, and now contains a wide range of useful documents, including general information about text analysis, pedagogically-oriented text analysis recipes, and content from several projects engaged in open research (publicly documenting themselves). Most recently TADA has _____________________________________________________________________________ 34 Digital Humanities 2008 _____________________________________________________________________________ announced its inaugural T-REX event (http://tada.mcmaster.ca/ trex/), a community initiative with the following objectives: • to identify topics of shared interest for further development in text analysis • to evaluate as a community ideas and tools that emerge from the topics • to present results from this process as research in various venues This panel will provide a more detailed account of the history and activities of TADA, as well as a preliminary account of TREX from the participants themselves, where possible. Agora.Techno.Phobia.Philia2: Feminist Critical Inquiry, Knowledge Building, Digital Humanities Martha Nell Smith mnsmith@umd.edu University of Maryland, USA Susan Brown sbrown@uoguelph.ca University of Guelph, Canada Laura Mandell mandellc@muohio.edu Miami University, USA Katie King katking@umd.edu University of Maryland, USA Marilee Lindemann mlindema@umd.edu University of Maryland, USA Rebecca Krefting beckortee@starpower.net University of Maryland, USA Amelia S. Wong awong22@umd.edu University of Maryland, USA In the later twentieth century, the humanities were transformed by the rise of gender studies and related critical movements that investigate the impact of socialized power distribution on knowledge formations and institutions. Throughout the humanities, the very research questions posed have been fundamentally altered by the rise of feminist and, more recently, critical race, queer, and related social justice-oriented critiques. These critical perspectives, as speakers in last year’s DH2007 panel “Agora.Techno.Phobia.Philia: Gender, Knowledge Building, and Digital Media” amply demonstrated, have much to contribute not only to thinking about women in relation to information and communication technologies but also to advancing the work and efficacies of information and communications technologies. The session we propose extends and deepens the critical inquiries posed in last year’s session. Though Carolyn Guertin will not be able to join us in Finland this year, we each will take into account the telling survey (available upon request) she presented of virtual space and concomitant debates devoted to women and information technology, as well as her compelling analysis of the stakes involved for female participants in such debates. Important also for our considerations are the special issue of Frontiers devoted to gender, race, and information technology (http:// _____________________________________________________________________________ 35 Digital Humanities 2008 _____________________________________________________________________________ muse.jhu.edu/journals/frontiers/toc/fro26.1.html); recent books by feminist thinkers such as Isabel Zorn, J. McGrath Cohoon and William Aspray, Lucy Suchman; and the most recent issue of Vectors devoted to Difference (Fall 2007; http:// www.vectorsjournal.org/index.php?page=6%7C1). Though these publications demonstrate keen and widespread interest in messy and ambiguous questions of diversity and technology and new media, such work and debates seem oddly absent from or at least underappreciated by the digital humanities (DH) community itself. It’s not that the Digital Humanities community feels hostile to women or to feminist work per se. Women have historically been and currently are visible, audible, and active in ADHO organizations, though not in equal numbers as men. The language used in papers is generally gender-aware. Conference programs regularly include work about feminist projects, that is to say ones that focus on feminist content (such as the interdisciplinary The Orlando Project, The Poetess Archive, Women Writers Project, Dickinson Electronic Archives). Though other DH projects and initiatives are inter- or transdisciplinary, their interdisciplinarities have tended not to include feminist elements, presumably because those feminist lines of inquiry are assumed to be discipline- or content-based (applicable only to projects such as those mentioned above), hence are largely irrelevant to what the digital humanities in general are about. Thus DH conference exchanges tend to have little to do with gender. In other words, DH debates over methodological and theoretical approaches, or over the substance of digital humanities as a field, have not been informed by the research questions fundamentally altered by feminist, critical race, queer, and related social justiceoriented critiques, the very critiques that have transformed the humanities as a whole. Why? The timing of the rise of the “field” perhaps?-much of the institutionalization and expansion, if not the foundation, of digital humanities took place from the late 20th century onwards. As recent presentations at major DH centers make clear (Smith @ MITH in October 2007, for example), there is a sense among some in the DH community that technologies are somehow neutral, notwithstanding arguments by Andrew Feenberg, Lucy Suchman, and others in our bibliography that technology and its innovations inevitably incorporate patterns of domination and inequity or more generalized social power distributions. Yet if we have learned one thing from critical theory, it is that interrogating the naturalized assumptions that inform how we make sense of the world is crucial in order to produce knowledge. Important for our panel discussion (and for the companion poster session we hope will be accepted as part of this proposal) is feminist materialist inquiry, which urges us to look for symptomatic gaps and silences in intellectual inquiry and exchange and then analyze their meanings in order to reinflect and thereby enrich the consequent knowledgebuilding discourses. So what happens if we try to probe this silence on gender in digital humanities discourse, not only in regard to institutions and process, but in regard to defining and framing DH debates? Building on the work of Lucy Suchman and the questions of her collaborators on the panel, Martha Nell Smith will show how questions basic to feminist critical inquiry can advance our digital work (not just that of projects featuring feminist content): How do items of knowledge, organizations, working groups come into being? Who made them? For what purposes? Whose work is visible, what is happening when only certain actors and associated achievements come into public view? What happens when social orders, including intellectual and social framings, are assumed to be objective features of social life (i.e., what happens when assumptions of objectivity are uninformed by ethnomethodology, which reminds us that social order is illusory and that social structures appear to be orderly but are in reality potentially chaotic)? Doing so, Smith will analyze the consequences of the politics of ordering within disambiguating binary divisions (Subject and object; Human and nonhuman; Nature and culture; Old and new) and the consequent constructions and politics of technology:Where is agency located? Whose agencies matter? What counts as innovation: why are tools valorized? How might the mundane forms of inventive, yet taken for granted labor, necessary (indeed VITAL) to the success of complex sociotechnical arrangements, be more effectively recognized and valued for the crucial work they do? Smith will argue that digital humanities innovation needs to be sociological as well as technological if digital humanities is to have a transformative impact on traditional scholarly practices. Susan Brown will open up the question-what happens if we try to probe this silence on gender in digital humanties discoursein a preliminary way by focusing on a couple of terms that circulate within digital humanities discourse in relation to what the digital humanities are (not) about: delivery and service. Delivery and interface work are intriguingly marginal rather than central to DH concerns, despite what we know about the impact of usability on system success or failure.Willard McCarty regards the term “delivery” as metaphorically freighted with connotations of knowledge commodification and mug-and-jug pedagogy (6). However, informed by feminist theory and the history of human birthing, we might alternatively mobilize the less stable connotations of delivery as “being delivered of, or act of bringing forth, offspring,” which offers a model open to a range of agents and participants, in which the process and mode of delivery have profound impacts on what is delivered. The history of the forceps suggests a provocative lens on the function of tools in processes of professionalization and disciplinary formation.An emphasis on delivery from conception through to usability studies of fully developed interfaces willif it takes seriously the culturally and historically grounded analyses of technology design and mobilization modeled by Lucy Suchman-go a considerable distance towards breaking down the oppositions between designer and user, producer and consumer, technologist and humanist that prevent digital humanities from having transformative impacts on traditional scholarly practices. This sense of delivery foregrounds boundaries as unstable and permeable, and boundary issues, as Donna Haraway was among the first to argue, have everything to do with the highly politicized-and gendered-category of the _____________________________________________________________________________ 36 Digital Humanities 2008 _____________________________________________________________________________ human, the subject/object of humanist knowledge production. The relationships among service (often married to the term “support”), disciplinarity, and professionalization within debates over the nature and identity of humanities computing are similarly important for reimagining how DH work might be advanced and might have a greater impact on the humanities as a whole. Service, a feminized concept, recurs as a danger to the establishment of legitimacy (Rommel; Unsworth). Sliding, as Geoffrey Rockwell points out, easily into the unequivocally denigrating “servile,” service too challenges us to be aware of how institutional power still circulates according to categories and hierarchies that embed social power relations. As Susan Leigh Star has argued, marginality, liminality, or hybridity, multiple membership that crosses identities or communities, provide valuable vantage points for engagement with shifting technologies (Star 50-53). These two liminal terms within digital humanities debates may open up digital humanities debates to new questions, new prospects, and Brown will offer examples of some ways in which this might be done in order to enrich DH work in general. By focusing on the terms gynesis and ontology, Laura Mandell examines the relationship between concepts in traditional epistemological discourses and those concepts as they appear in discussions about and implementations of the semantic web: hierarchy, ontology, individual, and class. First, she will look at traditional meanings of these terms, performing philological analysis of their origin and province as well as locating them in their disciplinary silos - in the manner of Wikipedia’s disambiguation pages. Next, she will look at how the terms are defined in discussions of the emerging semantic web, and also at how they function practically in XML Schema, database design, and OWL, the Web Ontology Language. Much work has been done by feminist theorists and philosophers from Luce Irigaray to more recently Liz Stanley and Sue Wise in critiquing the sexism implicit in concepts of hierarchy, being, and individuality; Mary Poovey has tracked the emergence of modern notions of class in her book about the history of the fact. Mandell will extend these foundational critiques, asking whether any of the insights achieved by that work apply to the semantic web. Of particular concern for Mandell is the category of error. Simone de Beauvoir’s Second Sex first brought to light the notion of woman as Other, as “error” to the man, and Julia Kristeva (following anthropologist Mary Douglas) finds female agency lurking in the dirt or ambiguity of any logical system. If computational and generational languages such as those producing the semantic web cannot tolerate ambiguity and error, what will be repressed? Does the syntax of natural language allow for women’s subjectivity in a way that ontological relationships might not? What are the opportunities for adding relationships to these systems that capitalize upon the necessary logical errors that philosopher Luce Irigaray has designated as necessary if “The Human are two”? work in knowledge-building and pedagogy. Professors Katie King, who has worked for the past two decades on writing technologies and their implications for knowledge production, and Marilee Lindemann, who is a leading queer studies theorist and a blogger who has been exploring new forms of scholarly writing in her creative nonfiction, Roxie’s World (http://roxiesworld.blogspot.com/) will respond to the questions posed by Smith, Brown, and Mandell, formulating questions that aim to engage the audience in discourse designed to advance both our practices and our theories about what we are doing, where we might be going, and how augmenting our modes and designs for inquiry might improve both our day-to-day DH work and its influence on humanities knowledge production. Their questions will be posed ahead of time (by the end of May) on the website we will devote to this session (http:// www.mith.umd.edu/mnsmith/DH2008), as will “think piece” interactive essays contributed by all of the participants (we are mostly likely to use a wiki). As a result of ongoing work with feminist graduate students interested in new media, the possibilities realized and unrealized, we propose extending this panel asynchronously so that current dissertation research be featured as case studies more than suitable for testing some of the hypotheses, exploring the questions, and reflecting upon the issues raised by the panel presentations. Thus we propose two poster sessions (or one with two components) by University of Maryland graduate students: Rebecca Krefting, “Because HTML Text is Still a Text: The (False) Promise of Digital Objectivity,” and Amelia Wong, “Everyone a Curator?: The Potential and Limitations of Web 2.0 for New Museology.” Rounding out this panel will be two respondents to these three 12-15 minute presentations by experts in feminist and queer theory and practice, both of whom have expressed keen interest in DH through the evolving practices of their _____________________________________________________________________________ 37 Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ 38 Digital Humanities 2008 _____________________________________________________________________________ Papers _____________________________________________________________________________ 39 Digital Humanities 2008 _____________________________________________________________________________ _____________________________________________________________________________ 40 Digital Humanities 2008 _____________________________________________________________________________ Unicode 5.0 and 5.1 and Digital Humanities Projects index.html), and SIL’s tools (http://scripts.sil.org/Conversion) all provide samples of conversion methods for upgrading digital projects to include new Unicode characters. Deborah Winthrop Anderson Since Unicode is the accepted standard for character encoding, any critical assessment of Unicode made to the body in charge of Unicode, the Unicode Technical Committee, is generally limited to comments on whether a given character is missing in Unicode or--if proposed or currently included in Unicode--critiques of a character’s glyph and name, as well as its line-breaking properties and sorting position. In Chapter 5 of the TEI P5 Guidelines, mention is made of character properties, but it does not discuss line-breaking or sorting, which are now two components of Unicode proposals and are discussed in annexes and standards on the Unicode Consortium website (Unicode Standard Annex #14 “Line Breaking Properties,” Unicode Technical Standard #10, “Unicode Collation Algorithm,” both accessible from www. unicode.org). Users should pay close attention to these two features, for an incorrect assignment can account for peculiar layout and sorting features in software. Comments on missing characters, incorrect glyphs or names, and properties should all be directed to the Unicode online contact page (http:// www.unicode.org/reporting.html). It is recommended that an addition to Chapter 5 of P5 be made regarding word-breaking and collation when defining new characters. dwanders@sonic.net UC Berkeley, USA In theory, digital humanities projects should rely on standards for text and character encoding. For character encoding, the standard recommended by TEI P5 (TEI Consortium, eds. Guidelines for Electronic Text Encoding and Interchange [last modified: 03 Feb 2008], http://www.tei-c.org/P5/) is the Unicode Standard (http://www.unicode.org/standard/ standard.html). The choices made by digital projects in character encoding can be critical, as they impact text analysis and language processing, as well as the creation, storage, and retrieval of such textual digital resources. This talk will discuss new characters and important features of Unicode 5.0 and 5.1 that could impact digital humanities projects, discuss the process of proposing characters into Unicode, and provide the theoretical underpinnings for acceptance of new characters by the standards committees. It will also give specific case studies from recent Unicode proposals in which certain characters were not accepted, relaying the discussion in the standards committees on why they were not approved. This latter topic is important, because decisions made by the standards committees ultimately will affect text encoding. For those characters not in Unicode, the P5 version of the TEI Guidelines deftly describes what digital projects should do in Chapter 5 (TEI Consortium, eds. “Representation of Nonstandard Characters and Glyphs,” Guidelines for Electronic Text Encoding and Interchange [last modified: 03 Feb 2008], http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ WD.html [accessed: 24 March 2008]), but one needs to be aware of the new characters that are in the standards approval process. The presentation will briefly discuss where to go to look for the new characters on public websites, which are “in the pipeline.” The Unicode Standard will, with Unicode 5.1, have over 100,000 characters encoded, and proposals are underway for several unencoded historic and modern minority scripts, many through the Script Encoding Initiative at UC Berkeley (http://www.linguistics.berkeley.edu/sei/alpha-script-list.html). Reviewing the glyphs, names, and character properties for this large number of characters is difficult. Assistance from the academic world is sought for (a) authoring and review of current proposals of unencoded character and scripts, and (b) proofing the beta versions of Unicode. With the participation of digital humanists, this character encoding standard can be made a reliable and useful standard for such projects. The release of Unicode 5.0 in July 2007 has meant that an additional 1,369 new characters have been added to the standard, and Unicode 5.1, due to be released in April 2008, will add 1,624 more (http://www.unicode.org/versions/ Unicode5.1.0/) In order to create projects that take advantage of what Unicode and Unicode-compliant software offers, one must be kept abreast of developments in this standard and make appropriate changes to fonts and documents as needed. For projects involving medieval and historic texts, for example, the release of 5.1 will include a significant number of European medieval letters, as well as new Greek and Latin epigraphic letters, editorial brackets and half-brackets, Coptic combining marks, Roman weights and measures and coin symbols, Old Cyrillic letters and Old Slavonic combining letters.The Menota project (http://www.menota.org/guidelines-2/convertors/ convert_2-0-b.page), EMELD’s “School of Best Practice” (http://linguistlist.org/emeld/school/classroom/conversion/ _____________________________________________________________________________ 41 Digital Humanities 2008 _____________________________________________________________________________ Variation of Style: Diachronic Aspect Vadim Andreev smol.an@mail.ru Smolensk State University, Russian Federation Introduction Among works, devoted to the quantitative study of style, an approach prevails which can be conventionally called as synchronic. Synchronic approach is aimed at solving various classification problems (including those of attribution), making use of average (mean) values of characteristics, which reflect the style of the whole creative activity of an author. This approach is based on the assumption that the features of an individual style are not changing during lifetime or vary in time very little, due to which the changes can be disregarded as linguistically irrelevant. This assumption can be tested in experiments, organised within a diachronic approach, whose purpose is to compare linguistic properties of texts, written by the same author at different periods of his life. This paper presents the results of such a diachronic study of the individual style of famous American romantic poet E.A.Poe. The study was aimed at finding out whether there were linguistically relevant differences in the style of the poet at various periods of his creative activity and if so, at revealing linguistic markers for the transition from one period to the other. Material The material includes iambic lyrics published by Poe in his 4 collections of poems. Lyrics were chosen because this genre expresses in the most vivid way the essential characteristics of a poet. In order to achieve a common basis for the comparison only iambic texts were taken, they usually did not exceed 60 lines. It should be noted that iamb was used by Poe in most of his verses. Sonnets were not taken for analysis because they possess specific structural organization. Poe’s life is divided into three periods: (1) from Poe’s first attempts to write verses approximately in 1824 till 1829, (2) from 1830 till 1835 and (3) from 1836 till 1849. Characteristics For the analysis 27 characteristics were taken. They include morphological and syntactic parameters. Morphological characteristics are formulated in terms of traditional morphological classes (noun, verb, adjective, adverb and pronoun). We counted how many times each of them occurs in the first and the final strong (predominantly stressed) syllabic positions – ictuses. Most of syntactic characteristics are based on the use of traditional notions of the members of the sentence (subject, predicate, object, adverbial modifier) in the first and the final strong positions in poems. Other syntactic parameters are the number of clauses in (a) complex and (b) compound sentences. There are also several characteristics which represent what can be called as poetical syntax. They are the number of enjambements, the number of lines divided by syntactic pauses and the number of lines, ending in exclamation or question marks. Enjambement takes place when a clause is continued on the next line (And what is not a dream by day / To him whose eyes are cast / On things around him <…>). Pause is a break in a line, caused by a subordinate clause or another sentence (I feel ye now – I feel ye in your strength – <…>). The values of the characteristics, which were obtained as a result of the analysis of lyrics, were normalised over the size of these texts in lines. Method One of multivariate methods of statistical analyses – discriminant analysis – was used. This method has been successfully used in the study of literary texts for authorship detection (Stamatatos, Fakatakis and Kokkinakis 2001; Baayen, Van Halteren, and Tweedie 1996, etc.), genre differentiation (Karlgen, Cutting 1994; Minori Murata 2000, etc.), gender categorization (Koppel et al. 2002; Olsen 2005), etc. Discriminant analysis is a procedure whose purpose is to find characteristics, discriminating between naturally occurring (or a priori formed) classes, and to classify into these classes separate (unique) cases which are often doubtful and “borderline”. For this purpose linear functions are calculated in such a way as to provide the best differentiation between the classes. The variables of these functions are characteristics of objects, relevant for discrimination. Judging by the coefficients of these variables we can single out the parameters which possess maximum discriminating force. Besides, the procedure enables us to test the statistical significance of the obtained results (Klecka, 1989). In this paper discriminant analysis is used to find out if there is any difference between groups of texts written during Periods 1–3, reveal characteristics differentiating these text groups and establish their discriminating force. Results It would be natural to expect that due to Poe’s relatively short period of creative activity (his first collection of poems was published in 1827, his last collection – in 1845) his individual style does not vary much, if at all. Nevertheless the results show that there are clearly marked linguistic differences between his texts _____________________________________________________________________________ 42 Digital Humanities 2008 _____________________________________________________________________________ written during these three periods. Out of 27 characteristics, used in the analysis, 14 proved to possess discriminating force, distinguishing between the verse texts of different periods of the author’s life. The strongest discriminating force was observed in morphological characteristics of words both in the first and final strong positions and syntactic characteristics of the initial part of verse lines.These parameters may be used for automatic classification of Poe’s lyrics into three groups corresponding to three periods of his creative activity with 100% correctness. The transition from the first to the second period is mainly characterised by changes in the number of verbs, nouns and pronouns in the first and the last strong positions, as well as in the number of subordinate clauses in complex sentences, words in the function of adverbial modifier in the initial position in the line. The development in Poe’s style from the second to the third period is also marked by changes in the number of morphological classes of words in the initial and final strong positions of the line (nouns, adverbs and pronouns). It should be stressed that these changes reflect general tendencies of variation of frequencies of certain elements and are not present in all the texts. In the following examples the shift of verbs from the final part of the line, which is characteristic of the first period, to the initial strong position of the line (i.e. second syllable) in the second period is observed. Period 1 But when within thy waves she looks – Which glistens then, and trembles – Why, then, the prettiest of brooks On the whole the results show that there are certain linguistic features which reflect the changes in the style of E.A.Poe. Among important period markers are part of speech characteristics and several syntactic parameters. Bibliography Baayen, R.H.,Van Halteren, H., and Tweedie, F. (1996) Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11: 121–131. Karlgen, J., Cutting, D. (1994) Recognizing text genres with simple metrics using discriminant analysis. Proceedings of COLING 94, Kyoto: 1071–1075. Klecka, W.R. (1989). Faktornyj, diskriminantnyj i klasternyj analiz. [Factor, discriminant and cluster analysis]. Moscow: Finansi i statistica. Koppel, M, Argamon, S., and Shimoni, A.R. (2002) Automatically Categorizing Written Texts by Author Gender. Literary & Linguistic Computing, 17: 401–412. Murata, M. (2000) Identify a Text’s Genre by Multivariate Analysis – Using Selected Conjunctive Words and Particlephrases. Proceedings of the Institute of Statistical Mathematics, 48: 311–326. Olsen, M. (2005) Écriture Féminine: Searching for an Indefinable Practice? Literary & Linguistic Computing, 20: 147–164. Stamatatos, E., Fakatakis, N., & Kokkinakis, G. (2001). Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities, 35: 193–214. Her worshipper resembles – For in my heart – as in thy streem – Her image deeply lies <...> (To the River) Period 2 You know the most enormous flower – That rose – <...> I tore it from its pride of place And shook it into pieces <...> (Fairy Land) _____________________________________________________________________________ 43 Digital Humanities 2008 _____________________________________________________________________________ Exploring Historical Image Collections with Collaborative Faceted Classification Georges Arnaout garna001@odu.edu Old Dominion University, USA Kurt Maly maly@cs.odu.edu Old Dominion University, USA Milena Mektesheva mmekt001@odu.edu Old Dominion University, USA Harris Wu hwu@odu.edu Old Dominion University, USA Mohammad Zubair zubair@cs.odu.edu Old Dominion University, USA, The US Government Photos and Graphics Collection include some of the nation’s most precious historical documents. However the current federation is not effective for exploration. We propose an architecture that enables users to collaboratively construct a faceted classification for this historical image collection, or any other large online multimedia collections. We have implemented a prototype for the American Political History multimedia collection from usa.gov, with a collaborative faceted classification interface. In addition, the proposed architecture includes automated document classification and facet schema enrichment techniques. Introduction It is difficult to explore a large historical multimedia humanities collection without a classification scheme. Legacy items often lack textual description or other forms of metadata, which makes search very difficult. One common approach is to have librarians classify the documents in the collection. This approach is often time or cost prohibitive, especially for large, growing collections. Furthermore, the librarian approach cannot reflect diverse and ever-changing needs and perspectives of users.As Sir Tim Berners-Lee commented:“the exciting thing [about Web] is serendipitous reuse of data: one person puts data up there for one thing, and another person uses it another way.” Recent social tagging systems such as del. icio.us permit individuals to assign free-form keywords (tags) to any documents in a collection. In other words, users can contribute metadata. These tagging systems, however, suffer from low quality of tags and lack of navigable structures. The system we are developing improves access to a large multimedia collection by supporting users collaboratively build a faceted classification. Such a collaborative approach supports diverse and evolving user needs and perspectives. Faceted classification has been shown to be effective for exploration and discovery in large collections [1]. Compared to search, it allows for recognition of category names instead of recalling of query keywords. Faceted classification consists of two components: the facet schema containing facets and categories, and the association between each document and the categories in the facet schema. Our system allows users to collaboratively 1) evolve a schema with facets and categories, and 2) to classify documents into this schema. Through users’ manual efforts and aided by the system’s automated efforts, a faceted classification evolves with the growing collection, the expanding user base, and the shifting user interests. Our fundamental belief is that a large, diverse group of people (students, teachers, etc.) can do better than a small team of librarians in classifying and enriching a large multimedia collection. Related Research Our research builds upon popular wiki and social tagging systems. Below we discuss several research projects closest to ours in spirit. The Flamenco project [1] has developed a good browsing interface based on faceted classification, and has gone through extensive evaluation with digital humanities collections such as the fine art images at the museums in San Francisco. Flamenco, however, is a “read-only” system. The facet schema is predefined, and the classification is pre-loaded. Users will not be able to change the way the documents are classified. The Facetag project [2] guides users’ tagging by presenting a predetermined facet schema to users. While users participate in classifying the documents, the predetermined facet schema forces users to classify the documents from the system’s perspective. The rigid schema is insufficient in supporting diverse user perspectives. A few recent projects [4, 7] attempt to create classification schemas from tags collected from social tagging systems. So far these projects have generated only single hierarchies, instead of multiple hierarchies as in faceted schemas. Also just as any other data mining systems, these automatic classification approaches suffers from quality problems. So far, no one has combined user efforts and automated techniques to build a faceted classification, both to build the schema and to classify documents into it, in a collaborative and interactive manner. _____________________________________________________________________________ 44 Digital Humanities 2008 _____________________________________________________________________________ Architecture and Prototype Implementation The architecture of our system is shown in Figure 1. Users can not only tag (assign free-form keywords to) documents but also collaboratively build a faceted classification in a wiki fashion. Utilizing the metadata created by users’ tagging efforts and harvested from other sources, the system help improve the classification.We focus on three novel features: 1) to allow users collaboratively build and maintain a faceted classification, 2) to systematically enrich the user-created facet schema, 3) to automatically classify documents into the evolving facet schema. the collection.This collection is poor with metadata and tools, which is common to many digital humanity collections that contain legacy items that have little pre-existing metadata, or lack resources for maintenance. Figure 1. System Architecture We have developed a Web-based interface that allows users create and edit facets/categories similar to managing directories in the Microsoft File Explorer. Simply by clicking and dragging documents into faceted categories, users can classify (or re-classify) historic documents. All the files and documents are stored in a MySQL database. For automatic classification, we use a support vector machine method [5] utilizing users’ manual classification as training input. For systematic facet enrichment, we are exploring methods that create new faceted categories from free-form tags based on a statistical co-occurrence model [6] and also WordNet [8]. Note that the architecture has an open design so that it can be integrated with existing websites or content management systems. As such the system can be readily deployed to enrich existing digital humanity collections. Figure 2. American Political History Collection at usa.gov The prototype focused on the collaborative classification interface. After deploying our prototype, the collection has been collaboratively classified into categories along several facets. To prove the openness of system architecture, the prototype has been integrated with different existing systems. (Figure 3) We have deployed a prototype on the American Political History (APH) sub-collection (http://teachpol.tcnj.edu/ amer_pol_hist) of the US Government Photos and Graphics Collection, a federated collection with millions of images (http://www.usa.gov/Topics/Graphics.shtml). The APH collection currently contains over 500 images, many of which are among the nation’s most valuable historical documents. On the usa.gov site, users can explore this collection only by two ways: either by era, such as 18th century and 19th century, or by special topics, such as “presidents” (Figure 2). There are only four special topics manually maintained by the collection administrator, which do not cover most items in _____________________________________________________________________________ 45 Digital Humanities 2008 _____________________________________________________________________________ Faceted Classification button on the bottom of the screen (the button to the right links to a social tagging system, del.icio.us) The system integrated with a Flamenco-like Interface The system integrated with Joomla!, a popular content management system Figure 3. Multi-facet Browsing As users explore the system (such as by exploring faceted categories or through a keyword search), besides each item there is a “classify” button which leads to the classification interface. The classification interface shows the currently assigned categories in various facets for the selected item. It allows user to drag and drop an item into a new category. At this level user can also add or remove categories from a facet, or add or remove a facet. The classification interface. Users can create/edit facets and categories, and drag items into categories Figure 4. Classification Interface Evaluation and Future Steps Initial evaluation results in a controlled environment show great promise.The prototype was tested by university students interested in American political history. The collection was collaboratively categorized into facets such as Artifact (map, photo, etc.), Location, Year, and Topics (Buildings, Presidents, etc.) The prototype is found to be more effective than the original website in supporting user’s retrieval tasks, in terms of both recall and precision. At this time, our prototype does not have all the necessary support to be deployed on public Internet for a large number of users. For this we need to work on the concept of hardening a newly added category or facet. The key idea behind hardening is to accept a new category or _____________________________________________________________________________ 46 Digital Humanities 2008 _____________________________________________________________________________ facet only after reinforcement from multiple users. In absence of hardening support our system will be overwhelmed by the number of new facets and categories. We are also exploring automated document classification and facet schema enrichment techniques. We believe that collaborative faceted classification can improve access to many digital humanities collections. Acknowledgements This project is supported in part by the United States National Science Foundation, Award No. 0713290. References [1] Hearst, M.A., Clustering versus Faceted Categories for Information Exploration. Communications of the ACM, 2006, 49(4). [2] Quintarelli, E., L. Rosati, and Resmini, A. Facetag: Integrating Bottom-up and Top-down Classification in a Social Tagging System. EuroIA 2006, Berlin. [3] Wu, H. and M.D. Gordon, Collaborative filing in a document repository. SIGIR 2004: p. 518-519 [4] Heymann, P. and Garcia-Molina, H., Collaborative Creation of Communal Hierarchical Taxonomies in Social Tagging Systems. Stanford Technical Report InfoLab 2006-10, 2006. [5] Joachims, T. Text categorization with support vector machines. In Proceedings of 10th European Conference on Machine Learning, pages 137–142, April 1998. [6] Sanderson, M. and B. Croft, Deriving concept hierarchies from text. SIGIR 1999: p. 206-213. [7] Schmitz and Patrick, Inducing Ontology from Flickr Tags. Workshop in Collaborative Web Tagging, 2006. [8] WordNet: An Electronic Lexical Database. Christiane Fellbaum (editor). 1998. The MIT Press, Cambridge, MA. Annotated Facsimile Editions: Defining Macro-level Structure for Image-Based Electronic Editions Neal Audenaert neal.audenaert@gmail.com Texas A&M University, USA Richard Furuta furuta@cs.tamu.edu Texas A&M University, USA Introduction Facsimile images form a major component in many digital editing projects. Well-known projects such as the Blake Archive [Eaves 2007] and the Rossetti Archive [McGann 2007] use facsimile images as the primary entry point to accessing the visually rich texts in their collections. Even for projects focused on transcribed electronic editions, it is now standard practice to include high-resolution facsimile. Encoding standards and text processing toolkits have been the focus of significant research. Tools, standards, and formal models for encoding information in image-based editions have only recently begun to receive attention. Most work in this area has centered on the digitization and presentation of visual materials [Viscomi 2002] or detailed markup and encoding of information within a single image [Lecolinet 2002, Kiernan 2004, Dekhtyar 2006]. Comparatively little has been work has been done on modeling the large-scale structure of facsimile editions.Typically, the reading interface that presents a facsimile determines its structure. Separating the software used to model data from that used to build user interfaces has well-known advantages for both engineering and digital humanities practices. To achieve this separation, it is necessary to develop a model of a facsimile edition that is independent of the interface used to present that edition. In this paper, we present a unified approach for representing linguistic, structural, and graphical content of a text as an Annotated Facsimile Edition (AFED). This model grows out of our experience with several digital facsimile edition projects over more than a decade, including the Cervantes Project [Furuta 2001], the Digital Donne [Monroy 2007a], and the Nautical Archaeology Digital Library [Monroy 2007b]. Our work on these projects has emphasized the need for an intuitive conceptual model of a digital facsimile. This model can then serve as the basis for a core software module that can be used across projects without requiring extensive modification by software developers. Drawing on our prior work we have distilled five primary goals for such a model: _____________________________________________________________________________ 47 Digital Humanities 2008 _____________________________________________________________________________ • Openness: Scholars’ focused research needs are highly specific, vary widely between disciplines, and change over time. The model must accommodate new information needs as they arise. notes. While it is natural to treat facsimile images sequentially, any particular linear sequence represents an implementation decision—a decision that may not be implied by the physical document. For example, an editor may choose to arrange an edition of letters according to the date written, recipient, or thematic content. The image stream, therefore, is an implementation detail of the model. The structure of the edition is specified explicitly by the annotations. Annotation Management Perspective Analytical perspective e.g., physical structure, narrative elements, poetic. Type The name of this type of annotation, e.g., page, volume, chapter, poem, stanza Figure 1: A simplified diagram showing an edition of collected poems (in two volumes) represented as an annotated facsimile edition. Start Index The index into the image stream where this annotation starts. Stop Index The index into the image stream where this annotation ends. • Non-hierarchical: Facsimile editions contain some information that should be presented hierarchically, but they cannot be adequately represented as a single, properly nested hierarchy. Sequence A number for resolving the sequence of multiple annotations on the same page. • Restructuring: A facsimile is a representation of the physical form of a document, but the model should enable applications to restructure the original form to meet specific needs. • Alignment: Comparison between varying representations of the same work is a fundamental task of humanities research. The model must support alignment between facsimiles of different copies of a work. Annotated Facsimile Editions Content Canonical Name A canonical name that uniquely identifies this content relative to a domain specific classification scheme. Display Name The name to be displayed when referring to an instance this annotation Properties A set of key/value pairs providing domain specific information about the annotation. Transcriptions A set of transcriptions of the content that this annotation specifies. Structural Information Parent A reference to the parent of this annotation. Children A list of references to the children of this annotation Table 1: Information represented by an annotation. The Annotated Facsimile Edition (AFED) models the macro level structure of facsimile editions, representing them as a stream of images with annotations over that stream. Figure 1 shows a simplified diagram illustrating a two-volume edition of collected poems. Annotations encode the structure of the document and properties of the structural elements they represent. Separate annotation streams encode multiple analytical perspectives. For example, in figure 1, the annotations shown below the image stream describe the physical structure of the edition (volumes, pages, and lines) while the annotations shown above the image stream describe the poetic structure (poems, titles, epigraphs, stanzas). Annotations within a single analytical perspective—but not those from different perspectives—follow a hierarchical structure. Many historical texts exist only as fragments. Many more have suffered damage that results in the lost of a portion of the original text. Despite this damage, the general content and characteristics of the text may be known or hypothesized based on other sources. In other cases, while the original artifact may exist, a digital representation of all or part of the artifact may unavailable initially. To enable scholars to work with missing or unavailable portions of a facsimile, we introduce the notion of an abstract image. An abstract image is simply a placeholder for a known or hypothesized artifact of the text for which no image is available. Annotations attach to abstract images in the same way they attach to existing images. The Image Stream Annotations are the primary means for representing structural and linguistic content in the AFED. An annotation identifies a range of images and specifies properties about those images. Table 1 lists the information specified by each annotation. Properties in italics are optional. As shown in this table, annotations support three main categories of information: annotation management, content, and structural information. The image stream intuitively corresponds to the sequential ordering of page images in a traditional printed book. These images, however, need not represent actual “pages.” An image might show a variety of artifacts including an opening of a book, a fragment of a scroll, or an unbound leaf of manuscript Annotations _____________________________________________________________________________ 48 Digital Humanities 2008 _____________________________________________________________________________ The annotation management and structural information categories contain record keeping information. Structural information describes the hierarchical structure of annotation within an analytical perspective. The annotation management category specifies the annotation type and identifies the image content referenced by the annotation.The sequence number is an identifier used by AFED to determine the relative ordering of multiple annotations that have the same starting index. AFED is agnostic to the precise semantics of this value. The annotation type determines these semantics. For example, a paragraph annotation may refer to the paragraph number relative to a page, chapter, or other structural unit. The content category describes the item referenced by the annotation. Annotations support two naming conventions. To facilitate comparison between documents, an annotation may specify a canonical name according to a domain specific naming convention. Canonical names usually do not match the name given to the referenced item by the artifact itself and are rarely appropriate for display to a general audience. Accordingly, the annotation requires the specification of a name suitable for display. Descriptive metadata can be specified as a set of key/value properties. In addition to descriptive metadata, annotations support multiple transcriptions. Multiple transcriptions allow alternate perspectives of the text; for example, a paleographic transcription to support detailed linguistic analysis and a normalized transcription to facilitate reading. Transcriptions may also include translations. AFED’s annotation mechanism defines a high-level syntactical structure that is sufficient to support the basic navigational needs of most facsimile projects. By remaining agnostic to semantic details, it allows for flexible, project specific customization. Where projects need to support user interactions that go beyond typical navigation scenarios, these interactions can be integrated into the user interface without requiring changes to the lower-level tools used to access the facsimile. Discussion AFED has proven to be a useful model in our work. We have deployed a proof of concept prototype based on the AFED model. Several of the facsimile editions constructed by the Cervantes Project use this prototype behind the scenes. Given its success in these reader’s interfaces, we are working to develop a Web-based editing toolkit. This application will allow editors to quickly define annotations and use those annotations to describe a facsimile edition.We anticipate completing this tool by the summer of 2008. By using multiple, hierarchical annotation streams, AFED’s expressive power falls under the well-studied class of document models, known as OHCO (ordered hierarchy of content objects). Specifically, it is an instance of a revised form of this generic model known as OHCO-3, [Renear 1996]. Whereas most prior research and development associated with the OHCO model has focused on XML-based, transcribed content, we have applied this model to the task of representing macro-level structures in facsimile editions. Focusing on macro-level document structure partially isolates the AFED model from the non-hierarchical nature of documents both in terms of the complexity of the required data structures, and in terms of providing simplified model to facilitate system implementation. If warranted by future applications, we can relax AFED’s hierarchical constraint. Relaxing this constraint poses no problems with the current prototype; however, further investigation is needed to determine potential benefits and drawbacks. In addition to macro-level structures, a document model that strives to represent the visual content of a document for scholarly purposes must also account for fine-grained structures present in individual images and provide support for encoded content at a higher level of detail. We envision using the AFED model in conjunction with models tailored for these low-level structures.We are working to develop a model for representing fine-grained structure in visually complex documents grounded in spatial hypermedia theory. Acknowledgements This material is based upon work support by National Science Foundation under Grant No. IIS-0534314. References [Dekhtyar 2006] Dekhtyar, A., et al. Support for XML markup of image-based electronic editions. International Journal on Digital Libraries 6(1) 2006, pp. 55-69. [Eaves 2007] Eaves, M., Essick, R.N.,Viscomi, J., eds. The William Blake Archive. http://www.blakearchive.org/ [24 November 2007] [Furuta 2001] Furuta, R., et al. The Cervantes Project: Steps to a Customizable and Interlinked On-Line Electronic Variorum Edition Supporting Scholarship. In Proceedings of ECDL 2001, LNCS, 2163. Springer-Verlag: Heidelberg, pp. 7182. [Kiernan 2004] Kiernan K., et al. The ARCHway Project: Architecture for Research in Computing for Humanities through Research, Teaching, and Learning. Literary and Linguistic Computing 2005 20(Suppl 1):69-88. [Lecolinet 2002] Lecolinet, E., Robert, L. and Role. F. Textimage Coupling for Editing Literary Sources. Computers and the Humanities 36(1): 2002 pp 49-73. _____________________________________________________________________________ 49 Digital Humanities 2008 _____________________________________________________________________________ [McGann 2007] McGann, J., The Complete Writings and Pictures of Dante Gabriel Rossetti. Institute for Advanced Technology in the Humanities, University of Virginia. http://www. rossettiarchive.org/ [24 November 2007] [Monroy 2007a] Monroy, C., Furuta, R., Stringer, G. Digital Donne: Workflow, Editing Tools and the Reader’s Interface of a Collection of 17th-century English Poetry. In Proceedings of Joint Conference on Digital Libraries JCDL 2007 (Vancouver, BC, June 2007), ACM Press: New York, NY, pp. 411-412. [Monroy 2007b] Monroy, C., Furuta, R., Castro, F. Texts, Illustrations, and Physical Objects: The Case of Ancient Shipbuilding Treatises. In Proceedings of ECDL 2007, LNCS, 4675. Springer-Verlag: Heidelberg, pp. 198-209. [Renear 1996] Renear, A., Mylonas, E., Durand, D. Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies. In Ide, N., Hockey, S. Research in Humanities Computing. Oxford: Oxford University Press, 1996. [Viscomi 2002] Viscomi, J. (2002). ‘Digital Facsimiles: Reading the William Blake Archive’. Kirschenbaum, M. (ed.) Imagebased Humanities Computing. spec. issue of Computers and the Humanities, 36(1): 27-48. CritSpace: Using Spatial Hypertext to Model Visually Complex Documents Neal Audenaert neal.audenaert@gmail.com Texas A&M University, USA, George Lucchese george_lucchese@tamu.edu Texas A&M University, USA, Grant Sherrick sherrick@csdl.tamu.edu Texas A&M University, USA Richard Furuta furuta@cs.tamu.edu Texas A&M University, USA In this paper, we present a Web-based interface for editing visually complex documents, such as modern authorial manuscripts. Applying spatial hypertext theory as the basis for designing this interface enables us to facilitate both interaction with the visually complex structure of these documents and integration of heterogeneous sources of external information. This represents a new paradigm for designing systems to support digital textual studies. Our approach emphasizes the visual nature of texts and provides powerful tools to support interpretation and creativity. In contrast to purely image-based systems, we are able to do this while retaining the benefits of traditional textual analysis tools. Introduction Documents express information as a combination of written words, graphical elements, and the arrangement of these content objects in a particular media. Digital representations of documents—and the applications built around them—typically divide information between primarily textual representations on the one hand (e.g., XML encoded documents) and primarily graphical based representations on the other (e.g., facsimiles). Image-based representations allow readers access to highquality facsimiles of the original document, but provide little support for explicitly encoded knowledge about the document. XML-based representations, by contrast, are able to specify detailed semantic knowledge embodied by encoding guidelines such as the TEI [Sperberg-McQueen 2003]. This knowledge forms the basis for building sophisticated analysis tools and developing rich hypertext interfaces to large document collections and supporting materials.This added power comes at a price.These approaches are limited by the need to specify all relevant content explicitly. This is, at best, a time consuming and expensive task and, at worst, an impossible one [Robinson 2000]. Furthermore, in typical systems, access to these texts _____________________________________________________________________________ 50 Digital Humanities 2008 _____________________________________________________________________________ mediated almost exclusively by the transcribed linguistic content, even when images alongside their transcriptions. By adopting spatial hypertext as a metaphor for representing document structure, we are able to design a system that emphasizes the visually construed contents of a document while retaining access to structured semantic information embodied in XML-based representations. Dominant hypertext systems, such as the Web, express document relationships via explicit links. In contrast, spatial hypertext expresses relationships by placing related content nodes near each other on a two-dimensional canvas [Marshall 1993]. In addition to spatial proximity, spatial hypertext systems express relationships through visual properties such as background color, border color and style, shape, and font style. Common features of spatial hypermedia systems include parsers capable of recognizing relationships between objects such as lists, list headings, and stacks, structured metadata attributes for objects, search capability, navigational linking, and the ability to follow the evolution of the information space via a history mechanism. The spatial hypertext model has an intuitive appeal for representing visually complex documents. According to this model, documents specify relationships between content objects (visual representations of words and graphical elements) based on their spatial proximity and visual similarity.This allows expression of informal, implicit, and ambiguous relationships— a key requirement for humanities scholarship. Unlike purely image-based representations, spatial hypertext enables users to add formal descriptions of content objects and document structure incrementally in the form of structured metadata (including transcriptions and markup). Hypermedia theorists refer to this process as “incremental formalism” [Shipman 1999]. Once added, these formal descriptions facilitate system support for text analysis and navigational hypertext. Another key advantage of spatial hypertext is its ability to support “information triage” [Marshall 1997]. Information triage is the process of locating, organizing, and prioritizing large amounts of heterogeneous information. This is particularly helpful in supporting information analysis and decision making in situations where the volume of information available makes detailed evaluation of it each resource impossible. By allowing users to rearrange objects freely in a two-dimensional workspace, spatial hypertext systems provide a lightweight interface for organizing large amounts of information. In addition to content taken directly from document images, this model encourages the inclusion of visual surrogates for information drawn from numerous sources. These include related photographs and artwork, editorial annotations, links to related documents, and bibliographical references. Editors/ readers can then arrange this related material as they interact with the document to refine their interpretive perspective. Editors/readers are also able to supply their own notes and graphical annotations to enrich the workspace further. System Design We are developing CritSpace as a proof of concept system using the spatial hypertext metaphor as a basis for supporting digital textual studies. Building this system from scratch, rather than using an existing application, allows us to tailor the design to meet needs specific to the textual studies domain (for example, by including document image analysis tools). We are also able to develop this application with a Web-based interface tightly integrated with a digital archive containing a large volume of supporting information (such as artwork, biographical information, and bibliographic references) as well as the primary documents. Initially, our focus is on a collection of manuscript documents written by Picasso [Audenaert 2007]. Picasso’s prominent use of visual elements, their tight integration with the linguistic content, and his reliance on ambiguity and spatial arrangement to convey meaning make this collection particularly attractive [Marin 1993, Michaël 2002].These features also make his work particularly difficult to represent using XML-based approaches. More importantly, Picasso’s writings contain numerous features that exemplify the broader category of modern manuscripts including documents in multiple states, extensive authorial revisions, editorial drafts, interlinear and marginal scholia, and sketches and doodles. Figure 1: Screenshot of CritSpace showing a document along with several related preparatory sketches Picasso made for Guernica about the same time. CritSpace provides an HTML based interface for accessing the collection maintained by the Picasso Project [Mallen 2007] that contains nearly 14,000 artworks (including documents) and 9,500 biographical entries. CritSpace allows users to arrange facsimile document images in a two dimensional workspace and resize and crop these images. Users may also search and browse the entire holdings of the digital library directly from CritSpace, adding related artworks and biographical information to the workspace as desired. In addition to content taken from the digital library, users may add links to other _____________________________________________________________________________ 51 Digital Humanities 2008 _____________________________________________________________________________ material available on the Web, or add their own comments to the workspace in the form of annotations. All of these items are displayed as content nodes that can be freely positioned and whose visual properties can be modified. Figure 1 shows a screenshot of this application that displays a document and several related artworks. CritSpace also introduces several features tailored to support digital textual studies. A tab at the bottom of the display opens a window containing a transcription of the currently selected item. An accordion-style menu on the right hand side provides a clipboard for temporarily storing content while rearranging the workspace, an area for working with groups of images, and a panel for displaying metadata and browsing the collection based on this metadata. We also introduce a full document mode that allows users to view a high-resolution facsimile. This interface allows users to add annotations (both shapes and text) to the image and provides a zooming interface to facilitate close examination of details. Future Work CritSpace provides a solid foundation for understanding how to apply spatial hypertext as a metaphor for interacting with visually complex documents.This perspective opens numerous directions for further research. A key challenge is developing tools to help identify content objects within a document and then to extracted these object in a way that will allow users to manipulate them in the visual workspace. Along these lines, we are working to adapt existing techniques for foreground/background segmentation [Gatos 2004], word and line identification [Manmatha 2005], and page segmentation [Shafait 2006]. We are investigating the use of Markov chains to align transcriptions to images semiautomatically [Rothfeder 2006] and expectation maximization to automatically recognize dominant colors for the purpose of separating information layers (for example, corrections made in red ink). Current implementations of these tools require extensive parameter tuning by individuals with a detailed understanding of the image processing algorithms. We plan to investigate interfaces that will allow non-experts to perform this parameter tuning interactively. Modern spatial hypertext applications include a representation of the history of the workspace [Shipman 2001]. We are interested in incorporating this notion, to represent documents, not as fixed and final objects, but rather objects that have changed over time.This history mechanism will enable editors to reconstruct hypothetical changes to the document as authors and annotators have modified it. It can also be used to allowing readers to see the changes made by an editor while constructing a particular form of the document. While spatial hypertext provides a powerful model for representing a single workspace, textual scholars will need tools to support the higher-level structure found in documents, such as chapters, sections, books, volumes. Further work is needed to identify ways in which existing spatial hypertext models can be extended express relationships between these structures and support navigation, visualization, and editing. Discussion Spatial hypertext offers an alternative to the dominate view of text as an “ordered hierarchy of content objects” (OCHO) [DeRose 1990]. The OCHO model emphasizes the linear, linguistic content of a document and requires explicit formalization of structural and semantic relationships early in the encoding process. For documents characterized by visually constructed information or complex and ambiguous structures, OCHO may be overly restrictive. In these cases, the ability to represent content objects graphically in a two dimensional space provides scholars the flexibility to represent both the visual aspects of the text they are studying and the ambiguous, multi-facetted relationships found in those texts. Furthermore, by including an incremental path toward the explicit encoding of document content, this model enables the incorporation of advanced textual analysis tools that can leverage both the formally specified structure and the spatial arrangement of the content objects. Acknowledgements This material is based upon work support by National Science Foundation under Grant No. IIS-0534314. References [Audenaert 2007] Audenaert, N. et al.Viewing Texts: An ArtCentered Representation of Picasso’s Writings. In Proceedings of Digital Humanities 2007 (Urbana-Champaign, IL, June, 2007), pp. 14-17. [DeRose 1990] DeRose, S., Durand, D., Mylonas, E., Renear, A. What is Text Really? Journal of Computing in Higher Education. 1(2), pp. 3-26. [Gatos 2004] Gatos, B., Ioannis, P., Perantonis, S. J., An Adaptive Binarization Technique for Low Quality Historical Documents. In Proceedings of Document Analysis Systems 2004. LNCS 3163 Springer-Verlag: Berlin, pp. 102-113. [Mallen 2006] Mallen, E., ed. (2007) The Picasso Project. Texas A&M University http://picasso.tamu.edu/ [25 November 2007] [Manmatha 2005] Manmatha, R., Rothfeder, J. L., A Scale Space Approach for Automatically Segmenting Words from Historical Handwritten Documents. In IEEE Transactions on _____________________________________________________________________________ 52 Digital Humanities 2008 _____________________________________________________________________________ Pattern Analysis and Machine Intelligence, 28(8), pp. 1212-1225. [Marin 1993] Marin, L. Picasso: Image Writing in Process. trans. by Sims, G. In October 65 (Summer 1993), MIT Press: Cambridge, MA, pp. 89-105. [Marshall 1993] Marshall, C. and Shipman, F. Searching for the Missing Link: Discovering Implicit Structure in Spatial Hypertext. In Proceedings of Hypertext ’93 (Seattle WA, Nov. 1993), ACM Press: New York, NY, pp. 217-230. [Marshall 1997] Marshall, C. and Shipman, F. Spatial hypertext and the practice of information triage. In Proceedings of Hypertext ’97 (Southampton, UK, Nov. 1997), ACM Press: New York, NY, pp. 124-133. [Michaël 2002] Michaël, A. Inside Picasso’s Writing Laboratory. Presented at Picasso: The Object of the Myth. November, 2002. http://www.picasso.fr/anglais/cdjournal.htm [December 2006] [Renear 1996] Renear, A., Mylonas, E., Durand, D. Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies. In Ide, N., Hockey, S. Research in Humanities Computing. Oxford: Oxford University Press, 1996. [Robinson 2000] Robinson, P. Ma(r)king the Electronic Text: How, Why, and for Whom? In Joe Bray et. al. Ma(r)king the Text:The Presentation of Meaning on the Literary Page. Ashgate: Aldershot, England, pp. 309-28. Glimpses though the clouds: collocates in a new light David Beavan d.beavan@englang.arts.gla.ac.uk University of Glasgow, UK This paper demonstrates a web-based, interactive data visualisation, allowing users to quickly inspect and browse the collocational relationships present in a corpus.The software is inspired by tag clouds, first popularised by on-line photograph sharing website Flickr (www.flickr.com). A paper based on a prototype of this Collocate Cloud visualisation was given at Digital Resources for the Humanities and Arts 2007. The software has since matured, offering new ways of navigating and inspecting the source data. It has also been expanded to analyse additional corpora, such as the British National Corpus (http://www.natcorp.ox.ac.uk/), which will be the focus of this talk. Tag clouds allow the user to browse, rather than search for specific pieces of information. Flickr encourages its users to add tags (keywords) to each photograph uploaded. The tags associated with each individual photograph are aggregated; the most frequent go on to make the cloud. The cloud consists of these tags presented in alphabetical order, with their frequency displayed as variation in colour, or more commonly font size. Figure 1 is an example of the most popular tags at Flickr: [Rothfeder 2006] Rothfeder, J., Manmatha, R., Rath, T.M., Aligning Transcripts to Automatically Segmented Handwritten Manuscripts. In Proceedings of Document Analysis Systems 2006. LNCS 3872 Springer-Verlag: Berlin, pp. 84-95. [Shafait 2006] Shafait, F., Keysers, D., Breuel, T. Performance Comparison of Six Algorithms for Page Segmentation. In Proceedings of Document Analysis Systems 2006. LNCS 3872 Springer-Verlag: Berlin, pp. 368-379. [Shipman 1999] Shipman, F. and McCall, R. Supporting Incremental Formalization with the Hyper-Object Substrate. In ACM Transactions on Information Systems 17(2), ACM Press: New York, NY, pp. 199-227. [Shipman 2001] Shipman, F., Hsieh, H., Maloor, P. and Moore, M. The Visual Knowledge Builder: A Second Generation Spatial Hypertext. In Proceedings of Twelfth ACM Conference on Hypertext and Hypermedia (Aarhus Denmark, August 2001), ACM Press, pp. 113-122. [Sperberg-McQueen 2003] Sperberg-Mcqueen, C. and Burnard, L. Guidelines for Electronic Text Encoding and Interchange:Volumes 1 and 2: P4, University Press of Virginia, 2003. Figure 1. Flickr tag cloud showing 125 of the most popular photograph keywords http://www.flickr.com/photos/tags/ (accessed 23 November 2007) The cloud offers two ways to access the information. If the user is looking for a specific term, the alphabetical ordering of the information allows it to be quickly located if present. More importantly, as a tool for browsing, frequent tags stand out visually, giving the user an immediate overview of the data. Clicking on a tag name will display all photographs which contain that tag. _____________________________________________________________________________ 53 Digital Humanities 2008 _____________________________________________________________________________ The cloud-based visualisation has been successfully applied to language. McMaster University’sTAPoRTools (http://taporware. mcmaster.ca/) features a ‘Word Cloud’ module, currently in beta testing. WMatrix (http://ucrel.lancs.ac.uk/wmatrix/) can compare two corpora by showing log-likelihood results in cloud form. In addition to other linguistic metrics, internet book seller Amazon provides a word cloud, see figure 2. Figure 3. British National Corpus through interface developed by Mark Davies, searching for ‘bank’, showing collocates http://corpus.byu.edu/bnc/ (accessed 23 November 2007) Figure 2. Amazon.com’s ‘Concordance’ displaying the 100 most frequent words in Romeo and Juliet http://www.amazon.com/Romeo-Juliet-Folger-ShakespeareLibrary/dp/sitb-next/0743477111/ref=sbx_con/1044970220-2133519?ie=UTF8&qid=1179135939&sr=11#concordance (accessed 23 November 2007) In this instance a word frequency list is the data source, showing the most frequent 100 words. As with the tag cloud, this list is alphabetically ordered, the font size being proportionate to its frequency of usage. It has all the benefits of a tag cloud; in this instance clicking on a word will produce a concordance of that term. The data contained in the table lends itself to visualisation as a cloud. As with the word cloud, the list of collocates can be displayed alphabetically. Co-occurrence frequency, like word frequency, can be mapped to font size. This would produce an output visually similar to the word cloud. Instead of showing all corpus words, they would be limited to those surrounding the chosen node word. Another valuable statistic obtainable via collocates is that of collocational strength,the likelihood of two words co-occurring, measured here by MI (Mutual Information). Accounting for this extra dimension requires an additional visual cue to be introduced, one which can convey the continuous data of an MI score. This can be solved by varying the colour, or brightness of the collocates forming the cloud. The end result is shown in figure 4: This method of visualisation and interaction offers another tool for corpus linguists. As developer for an online corpus project, I have found that the usability and sophistication of our tools have been important to our success. Cloud-like displays of information would complement our other advanced features, such as geographic mapping and transcription synchronisation. The word clouds produced by TAPoR Tools, WMatrix and Amazon are, for browsing, an improvement over tabular statistical information. There is an opportunity for other corpus data to be enhanced by using a cloud. Linguists often use collocational information as a tool to examine language use. Figure 3 demonstrates a typical corpus tool output: Figure 4. Demonstration of collocate cloud, showing node word ‘bank’ The collocate cloud inherits all the advantages of previous cloud visualisations: a collocate, if known, can be quickly located due to the alphabetical nature of the display. Frequently occurring collocates stand out, as they are shown in a larger typeface, with collocationally strong pairings highlighted using brighter _____________________________________________________________________________ 54 Digital Humanities 2008 _____________________________________________________________________________ formatting. Therefore bright, large collocates are likely to be of interest, whereas dark, small collocates perhaps less so. Hovering the mouse over a collocate will display statistical information, co-occurrence frequency and MI score, as one would find from the tabular view. The use of collocational data also presents additional possibilities for interaction. A collocate can be clicked upon to produce a new cloud, with the previous collocate as the new node word. This gives endless possibilities for corpus exploration and the investigation of different domains. Occurrences of polysemy can be identified and expanded upon by following the different collocates. Particular instances of usage are traditionally hidden from the user when viewing aggregated data, such as the collocate cloud. The solution is to allow the user to examine the underlying data by producing an optional concordance for each node/collocate pairing present. Additionally a KWIC concordance can be generated by examining the node word, visualising the collocational strength of the surrounding words. These concordance lines can even by reordered on the basis of collocational strength, in addition to the more traditional options of preceding or succeeding words. This visualisation may be appealing to members of the public, or those seeking a more practical introduction to corpus linguistics. In teaching use they not only provide analysis, but from user feedback, also act as stimulation in creative writing. Collocate searches across different corpora or document sets may be visualised side by side, facilitating quick identification of differences. While the collocate cloud is not a substitute for raw data, it does provide a fast and convenient way to navigate language. The ability to generate new clouds from existing collocates extends this further. Both this iterative nature and the addition of collocational strength information gives these collocate clouds greater value for linguistic research than previous cloud visualisations. The function and accuracy of old Dutch urban designs and maps. A computer assisted analysis of the extension of Leiden (1611) Jakeline Benavides j.benavides@rug.nl University of Groningen,The Netherlands Charles van den Heuvel charles.vandenheuvel@vks.knaw.nl Royal Netherlands Academy of Arts and Sciences,The Netherlands Historical manuscripts and printed maps of the pre-cadastral1 period show enormous differences in scale, precision and color. Less apparent are differences in reliability between maps, or between different parts of the same map. True, modern techniques in computer assisted cartography make it possible to measure very accurately such differences between maps and geographic space with very distinct levels of precision and accuracy. However differences in reliability between maps, or between different parts of the same map, are not only due to the accuracy measurement techniques, but also to their original function and context of (re-)use. Historical information about the original context of function and context of (re-)use can give us insight how to measure accuracy, how to choose the right points for geo-referencing and how to rectify digital maps. On the other hand computer assisted cartography enables us to trace and to visualize important information about mapmaking, especially when further historical evidence is missing, is hidden or is distorted, consciously or unconsciously. The proposed paper is embedded in the project: Paper and Virtual Cities, (subsidized by the Netherlands Organization for Scientific Research) that aims at developing methodologies that a) permit researchers to use historical maps and related sources more accurately in creating in digital maps and virtual reconstructions of cities and b) allow users to recognize better technical manipulations and distortions of truth used in the process of mapmaking.2 In this paper we present as one of the outcomes of this project a method that visualizes different levels of accuracy in and between designs and maps in relation to their original function to assess their quality for re-use today.This method is presented by analyzing different 17th century designs, manuscript and engraved maps of the city of Leiden, in particular of the land surveyor, mapmaker Jan Pietersz. Dou. The choice for Leiden and Dou is no coincidence. One of the reasons behind differences the accuracy of maps is the enormous variety in methods and measures used in land surveying and mapmaking in the Low Countries.3 This variety was the result of differences in the private training and the _____________________________________________________________________________ 55 Digital Humanities 2008 _____________________________________________________________________________ backgrounds of the surveyors, in the use of local measures and in the exam procedures that differed from province to province.4 These differences would last until the 19th Century. However, already by the end of the 16th Century we see in and around Leiden the first signs of standardization in surveying techniques and the use of measures.5 First of all, the Rhineland rod (3.767 meter) the standard of the water-administration body around Leiden is used more and more, along local measures in the Low Countries and in Dutch expansion overseas. A second reason to look at Leiden in more detail is that in 1600 a practical training school in land surveying and fortification would be founded in the buildings of its university, the so-called Duytsche Mathematique, that turned out to be very successful not only in the Low Countries, but also in other European countries. This not only contributed to the spread and reception of the Rhineland rod, but also to the dissemination of more standardized ways of land surveying and fortification.6 The instructional material of the professors of this training school and the notes of their pupils are still preserved, which allows us to study the process of surveying and mapmaking in more detail. The reason to look into the work of Jan Pietersz. Dou is his enormous production of maps. Westra (1994) calculated that at least 1150 maps still exist.7 Of the object of our case study alone, the city of Leiden, Dou produced at least 120 maps between 1600 and 1635, ranging from property maps, designs for extensions of the city and studies for civil engineering works etc. We will focus on the maps that Dou made for the urban extension of the city of Leiden of 1611. Sometimes these (partial) maps were made for a specific purpose; in other cases Dou tried in comprehensive maps, combining property estimates and future designs, to tackle problems of illegal economic activities, pollution, housing and fortification. Since these measurements were taken in his official role of, sworn-in land surveyor we can assume that they were supposed to be accurate. This variety in designs and maps for the same area allows us to discuss accuracy in relation to function and (re)use of maps.We will also explain that the differences between designs and maps require different methods of geo-referencing and analysis. In particular, we will give attention to one design map of Dou for the northern part of the city of Leiden RAL PV 100206 (Regionaal Archief Leiden) to show how misinterpretation of features lead to unreliable or biased decisions when the historical context is not taken into account, even when we can consider the results, in terms of accuracy, satisfactory. Since Dou later also made maps for commercial purposes of the same northern extension of Leiden it is interesting to compare these maps. Conclusions are drawn addressing the question of whether Dou used the same measurements to produce a commercial map or that he settled for less accuracy given the different purpose of the later map compared to his designs and property maps.To answer this question, we use modern digital techniques of geo-processing8 to compare the old maps to modern cartographical resources and to the cadastral map of the 1800s in order to determine how accurate the various maps in question are. We do this, by using the American National Standard for Spatial Data Accuracy (NSSDA) to define accuracy at a 95% confidence level. By point-based analysis we link distributional errors to classified features in order to find a relationship between accuracy and map function.9 Notes (1) Cadastral mapping refers to the “mapping of property boundaries, particularly to record the limitation of title or for the assessment of taxation”. The term “cadastral” is also used for referring to surveys and resurveys of public lands (Neumann, J. Encyclopedic Dictionary of Cartography in 25 Languages 1.16, 1997). (2) The project Paper and Virtual Cities. New methodologies for the use of historical sources in computer-assisted urban cartography (2003-2008) is a collaboration between the department of Alfa-Informatics of the University Groningen and the Virtual Knowledge Studio of the Royal Netherlands Academy of Arts and Sciences subsidized by the Netherlands Organization for Scientific Research (NWO). http://www. virtualknowledgestudio.nl/projects/paper-virtualcities.php (3) Meskens, A, Wiskunde tussen Renaissance en Barok. Aspecten van wiskunde-beoefening te Antwerpen 15501620, [Publikaties SBA/MVC 41-43], [PhD, University of Antwerp],Antwerp, 1995. H.C. Pouls, “Landmeetkundige methoden en instrumenten voor 1800”, in Stad in kaart, Alphen aan den Rijn, pp. 13-28 Winter, P.J. van, Hoger beroepsonderwijs avant-la-lettre: Bemoeiingen met de vorming van landmeters en ingenieurs bij de Nederlandse universiteiten van de 17e en 18e eeuw, Amsterdam/Oxford/New York, 1988. (4) Muller, E., Zanvliet, K., eds., Admissies als landmeter in Nederland voor 1811: Bronnen voor de geschiedenis van de landmeetkunde en haar toepassing in administratie, architectuur, kartografie en vesting-en waterbouwkunde, Alphen aan den Rijn 1987 (5) Zandvliet, K., Mapping for Money. Maps, plans and topographic paintings and their role in Dutch overseas expansion during the 16th and 17th centuries. (PhD Rijksuniversiteit Leiden], Amsterdam, 1998 describes this development as part of a process of institutionalization of mapmaking, esp, pp. 75-81. (6) Taverne, E.R.M., In ‘t land van belofte: in de nieue stadt. Ideaal en werkelijkheid van de stadsuitleg in de Republiek 1580-1680, [PhD, University of Groningen]Maarssen, 1978. _____________________________________________________________________________ 56 Digital Humanities 2008 _____________________________________________________________________________ Heuvel, C. van den, ‘Le traité incomplet de l’Art Militaire et l’ínstruction pour une école des ingénieurs de Simon Stevin’, Simon Stevin (1548-1620) L’emergence de la nouvelle science, (tentoonstellingscatalogus/catalogue Koninklijk Bibliotheek Albert I, Brussel/Bibliothèque Royale Albert I, Bruxelles, 17-09-2004-30-10-2004) Brussels 2004, pp. 101-111. idem, “The training of noblemen in the arts and sciences in the Low Countries around 1600. Treatises and instructional materials” in Alessandro Farnese e le Fiandre/Alexander and the Low Countries (in print) (7) Frans Westra, Jan Pietersz. Dou (1573-1635). “Invloedrijk landmeter van Rijnland”, Caert-thresoor, 13e jaargang 1994, nr. 2, pp. 37-48 (8) During geo-referencing a mathematical algorithm is used for scaling, rotating and translating the old map to give modern coordinates to it and allow further comparisons to modern sources. This algorithm is defined by a kind of transformation we decide to use based on the selection of a number of control points (GCPS). These items are described in detail later in this paper. All this processing was done by using different kind of software for digital and geographical processing and statistics (PCI Geomatics, ARCGIS, Autocad, MS Excel, among others). (9) Jakeline Benavides and John Nerbonne. Approaching Quantitative Accuracy in Early Dutch City Maps. XXIII International cartographic Conference. ISBN 978-5-99012031-0 (CD-ROM). Moscow, 2007 AAC-FACKEL and BRENNER ONLINE. New Digital Editions of Two Literary Journals Hanno Biber hanno.biber@oeaw.ac.at Austrian Academy of Sciences, Austria Evelyn Breiteneder evelyn.breiteneder@oeaw.ac.at Austrian Academy of Sciences, Austria Karlheinz Mörth karlheinz.moerth@oeaw.ac.at Austrian Academy of Sciences, Austria In this paper two highly innovative digital editions will be presented.The digital editions of the historical literary journals “Die Fackel” (published by Karl Kraus in Vienna from 1899 to 1936) and “Der Brenner” (published by Ludwig Ficker in Innsbruck from 1910 to 1954) have been developed within the corpus research framework of the “AAC - Austrian Academy Corpus” at the Austrian Academy of Sciences in collaboration with other researchers and programmers in the AAC from Vienna together with the graphic designer Anne Burdick from Los Angeles. For the creation of these scholarly digital editions the AAC edition philosophy and principles have been made use of whereby new corpus research methods have been applied for questions of computational philology and textual studies in a digital environment. The examples of these online editions will give insights into the potentials and the benefits of making corpus research methods and techniques available for scholarly research into language and literature. Introduction The “AAC - Austrian Academy Corpus” is a corpus research unit at the Austrian Academy of Sciences concerned with establishing and exploring large electronic text corpora and with conducting scholarly research in the field of corpora and digital text collections and editions. The texts integrated into the AAC are predominantly German language texts of historical and cultural significance from the last 150 years. The AAC has collected thousands of texts from various authors, representing many different text types from all over the German speaking world. Among the sources, which systematically cover various domains, genres and types, are newspapers, literary journals, novels, dramas, poems, advertisements, essays on various subjects, travel literature, cookbooks, pamphlets, political speeches as well as a variety of scientific, legal, and religious texts, to name just a few forms. The AAC provides resources for investigations into the linguistic and textual properties of these texts and into their historical and cultural qualities. More than 350 million running words of text have been scanned, _____________________________________________________________________________ 57 Digital Humanities 2008 _____________________________________________________________________________ digitized, integrated and annotated. The selection of texts is made according to the AAC’s principles of text selection that are determined by specific research interests as well as by systematic historical, empirical and typological parameters. The annotation schemes of the AAC, based upon XML related standards, have in the phase of corpus build-up been concerned with the application of basic structural mark-up and selective thematic annotations. In the phase of application development specific thematic annotations are being made exploring questions of linguistic and textual scholarly research as well as experimental and exploratory mark-up. Journals are regarded as interesting sources for corpus research because they comprise a great variety of text types over a long period of time. Therefore, model digital editions of literary journals have been developed: The AAC-FACKEL was published on 1 January 2007 and BRENNER ONLINE followed in October 2007. The basic elements and features of our approach of corpus research in the field of textual studies will be demonstrated in this paper. source for the history of the time, for its language and its moral transgressions. Karl Kraus covers in a typical and idiosyncratic style in thousands of texts the themes of journalism and war, of politics and corruption, of literature and lying. His influential journal comprises a great variety of essays, notes, commentaries, aphorisms and poems. The electronic text, also used for the compilation of a text-dictionary of idioms (“Wörterbuch der Redensarten zu der von Karl Kraus 1899 bis 1936 herausgegebenen Zeitschrift ‘Die Fackel’”), has been corrected and enriched by the AAC with information. The digital edition allows new ways of philological research and analysis and offers new perspectives for literary studies. BRENNER ONLINE AAC-FACKEL Figure 2. BRENNER ONLINE Interface Figure 1. AAC-FACKEL Interface The digital edition of the journal “Die Fackel” (“The Torch”), published and almost entirely written by the satirist and language critic Karl Kraus in Vienna from 1899 until 1936, offers free online access to 37 volumes, 415 issues, 922 numbers, comprising more than 22.500 pages and 6 million tokens. It contains a fully searchable database of the journal with various indexes, search tools and navigation aids in an innovative and functional graphic design interface, where all pages of the original are available as digital texts and as facsimile images. The work of Karl Kraus in its many forms, of which the journal is the core, can be regarded as one of the most important, contributions to world literature. It is a The literary journal “Der Brenner” was published between 1910 and 1954 in Innsbruck by Ludwig Ficker. The text of 18 volumes, 104 issues, which is a small segment of the AAC’s overall holdings, is 2 million tokens of corrected and annotated text, provided with additional information.“Die Fackel” had set an example for Ludwig Ficker and his own publication. Contrary to the more widely read satirical journal of Karl Kraus, the more quiet “Der Brenner” deals primarily with themes of religion, nature, literature and culture.The philosopher Ludwig Wittgenstein was affiliated with the group and participated in the debates launched by the journal. Among its contributors is the expressionist poet Georg Trakl, the writer Carl Dallago, the translator and cultural critic Theodor Haecker, translator of Søren Kierkegaard and Cardinal Newman into German, the moralist philosopher Ferdinand Ebner and many others. The journal covers a variety of subjects and is an important voice of Austrian culture in the pre and post second world war periods. The digital edition has been made by the AAC in collaboration with the Brenner-Archive of the University _____________________________________________________________________________ 58 Digital Humanities 2008 _____________________________________________________________________________ of Innsbruck. Both institutions have committed themselves to establish a valuable digital resource for the study of this literary journal. References Conclusion AAC - Austrian Academy Corpus: AAC-FACKEL, Online Version: Die Fackel. Herausgeber: Karl Kraus, Wien 18991936, AAC Digital Edition No 1, 2007, (http://www.aac. ac.at/fackel ) The philological and technological principles of digital editions within the AAC are determined by the conviction that the methods of corpus research will enable us to produce valuable resources for scholars. The AAC has developed such model editions to meet these aims. These editions provide well structured and well designed access to the sources. All pages are accessible as electronic text and as facsimile images of the original. Various indexes and search facilities are provided so that word forms, personal names, foreign text passages, illustrations and other resources can be searched for in various ways.The search mechanisms for personal names have been implemented in the interface for BRENNER ONLINE and will be done for “Die Fackel”. The interface is designed to be easily accessible also to less experienced users of corpora. Multi-word queries are possible. The search engine supports left and right truncation. The interface of the AAC-FACKEL provides also search mechanisms for linguistic searches, allows to perform lemma queries and offers experimental features. Instead of searching for particular word forms, queries for all the word forms of a particular lemma are possible. The same goes for the POS annotation. The web-sites of both editions are entirely based on XML and cognate technologies. On the character level use of Unicode has been made throughout. All of the content displayed in the digital edition is dynamically created from XML data. Output is produced through means of XSLT style sheets.This holds for the text section, the contents overview and the result lists. We have adopted this approach to ensure the viability of our data for as long a period as possible. Both digital editions have been optimized for use with recent browser versions. One of the basic requirements is that the browser should be able to handle XML-Dom and the local system should be furnished with a Unicode font capable of displaying the necessary characters. The interface has synchronized five individual frames within one single window, which can be alternatively expanded and closed as required.The “Paratext”-section situated within the first frame provides background information and essays. The “Index”section gives access to a variety of indexes, databases and full-text search mechanisms. The results are displayed in the adjacent section. The “Contents”-section has been developed, to show the reader the whole range of the journal ready to be explored and provides access to the whole run of issues in chronological order. The “Text”-section has a complex and powerful navigational bar so that the reader can easily navigate and read within the journals either in text-mode or in image-mode from page to page, from text to text, from issue to issue and with the help of hyperlinks. These digital editions will function as models for similar applications. The AAC’s scholarly editions of “Der Brenner” and “Die Fackel” will contribute to the development of digital resources for research into language and literature. AAC - Austrian Academy Corpus: http://www.aac.ac.at/fackel AAC - Austrian Academy Corpus and Brenner-Archiv: BRENNER ONLINE, Online Version: Der Brenner. Herausgeber: Ludwig Ficker, Innsbruck 1910-1954, AAC Digital Edition No 2, 2007, (http://www.aac.ac.at/brenner ) _____________________________________________________________________________ 59 Digital Humanities 2008 _____________________________________________________________________________ e-Science in the Arts and Humanities – A methodological perspective Tobias Blanke tobias.blanke@kcl.ac.uk King’s College London, UK Stuart Dunn stuart.dunn@kcl.ac.uk King’s College London, UK Lorna Hughes lorna.hughes@kcl.ac.uk King’s College London, UK Mark Hedges mark.hedges@kcl.ac.uk King’s College London, UK The aim of this paper is to provide an overview of e-Science and e-Research activities for the arts and humanities in the UK. It will focus on research projects and trends and will not cover the institutional infrastructure to support them. In particular, we shall explore the methodological discussions laid out in the Arts and Humanities e-Science Theme, jointly organised by the Arts and Humanities e-Science Support Centre and the e-Science Institute in Edinburgh (http://www.nesc.ac.uk/esi/ themes/theme_06/). The second focus of the paper will be the current and future activities within the Arts and Humanities e-Science Initiative in the UK and their methodological consequences (http://www.ahessc.ac.uk). The projects presented so far are all good indicators of what the future might deliver, as ‘grand challenges’ for the arts and humanities e-Science programme such as the emerging data deluge (Hey and Trefethen 2003). The Bush administration will have produced over 100 million emails by the end of its term (Unsworth 2006).These can provide the basis for new types of historical and socio-political research that will take advantage of computational methods to deal with digital data. However, for arts and humanities research an information is not just an information. Complicated semantics underlie the archives of human reports. As a simple example, it cannot be clear from the email alone which Bush administration or even which Iraq war are under consideration. Moreover, new retrieval methods for such data must be intuitive for the user and not based on complicated metadata schemes. They have to be specific in their return and deliver exactly that piece of information the researcher is interested in. This is fairly straightforward for structured information if it is correctly described, but highly complex for unstructured information. Arts and humanities additionally need the means to on-demand reconfigure the retrieval process by using computational power that changes the set of information items available from texts, images, movies, etc. This paper argues that a specific methodological agenda in arts and humanities e-Science has been developing over the past two years and explores some of its main tenets. We offer a chronological discussion of two phases in the methodological debates about the applicability of e-science and e-research to arts and humanities. The first phase concerns the methodological discussions that took place during the early activities of the Theme. A series of workshops and presentations about the role of eScience for arts and humanities purported to make existing e-science methodologies applicable to this new field and consider the challenges that might ensue (http://www.nesc. ac.uk/esi/themes/theme_06/community.htm). Several events brought together computer scientists and arts and humanities researchers. Further events (finished by the time of Digital Humanities 2008) will include training events for postgraduate students and architectural studies on building a national einfrastructure for the arts and humanities. Due to space limitations, we cannot cover all the early methodological discussions during the Theme here, but focus on two, which have been fruitful in uptake in arts and humanities: Access Grid and Ontological Engineering. A workshop discussed alternatives for video-conferencing in the arts and humanities in order to establish virtual research communities. The Access Grid has already proven to be of interest in arts and humanities research. This is not surprising, as researchers in these domains often need to collaborate with other researchers around the globe. Arts and humanities research often takes place in highly specialised domains and subdisciplines, niche subjects with expertise spread across universities.The Access Grid can provide a cheaper alternative to face-to-face meetings. However, online collaboration technologies like the Access Grid need to be better adapted to the specific needs of humanities researchers by e.g. including tools to collaboratively edit and annotate documents.The Access Grid might be a good substitute to some face-to-face meetings, but lacks innovative means of collaboration, which can be especially important in arts and humanities research. We should aim to realise real multicast interaction, as it has been done in VNC technology or basic wiki technology. These could support new models of collaboration in which the physical organisation of the Access Grid suite can be accommodated to specific needs that would e.g. allow participants to walk around. The procedure of Access Grid sessions could also be changed, away from static meetings towards more dynamic collaborations. Humanities scholars and performers have priorities and concerns that are often different from those of scientists and engineers (Nentwich 2003).With growing size of data resources the need arises to use recent methodological frameworks such as ontologies to increase the semantic interoperability of data. Building ontologies in the humanities is a challenge, which was the topic of the Theme workshop on ‘Ontologies and Semantic Interoperability for Humanities Data’.While semantic integration has been a hot topic in business and computing _____________________________________________________________________________ 60 Digital Humanities 2008 _____________________________________________________________________________ research, there are few existing examples for ontologies in the Humanities, and they are generally quite limited, lacking the richness that full-blown ontologies promise. The workshop clearly pointed at problems mapping highly contextual data as in the humanities to highly formalized conceptualization and specifications of domains. The activities within the UK’s arts and humanities e-Science community demonstrate the specific needs that have to be addressed to make e-Science work within these disciplines (Blanke and Dunn 2006). The early experimentation phase, which included the Theme events presented supra, delivered projects that were mostly trying out existing approaches in eScience. They demonstrated the need for a new methodology to meet the requirements of humanities data that is particularly fuzzy and inconsistent, as it is not automatically produced, but is the result of human effort. It is fragile and its presentation often difficult, as e.g. data in performing arts that only exists as an event. The second phase of arts and humanities e-Science began in September 2007 with seven 3-4 years projects that are moving away from ad hoc experimentation towards a more systematic investigation of methodologies and technologies that could provide answers to grand challenges in arts and humanities research. This second phase could be put in a nutshell as e-science methodology-led innovative research in arts and humanity. Next to performance, music research e.g. plays an important vanguard function at adopting e-Science methodologies, mostly because many music resources are already available in digital formats. At Goldsmiths, University of London, the project ‘Purcell Plus’ e.g. will build upon the successful collaboration ‘Online Musical Recognition and Searching (OMRAS)’ (http://www.omras.org/), which has just achieved a second phase of funding by the EPSRC. With OMRAS, it will be possible to efficiently search large-scale distributed digital music collections for related passages, etc. The project uses grid technologies to index the very large distributed music resources. ‘Purcell Plus’ will make use of the latest explosion in digital data for music research. It uses Purcell’s autograph MS of ‘Fantazies and In Nomines for instrumental ensemble’ and will investigate the methodology problems for using toolkits like OMRAS for musicology research. ‘Purcell Plus’ will adopt the new technologies emerging from music information retrieval, without the demand to change completely proven to be good methodologies in musicology.The aim is to suggest that new technologies can help existing research and open new research domains in terms of the quantity of music and new quantitative methods of evaluation. Birmingham, which will investigate the need for medieval states to sustain armies by organising and distributing resources. A grid-based framework shall virtually reenact the Battle of Manzikert in 1071, a key historic event in Byzantine history. Agent-based modelling technologies will attempt to find out more about the reasons why the Byzantine army was so heavily defeated by the Seljurk Turks. Grid environments offer the chance to solve such complex human problems through distributed simultaneous computing. In all the new projects, we can identify a clear trend towards investigating new methodologies for arts and humanities research, possible only because grid technologies offer unknown data and computational resources.We could see how e-Science in the arts and humanities has matured towards the development of concrete tools that systematically investigate the use of e-Science for research. Whether it is simulation of past battles or musicology using state-of-the-art information retrieval techniques, this research would have not been possible before the shift in methodology towards e-Science and e-Research. References Blanke, T. and S. Dunn (2006). The Arts and Humanities eScience Initiative in the UK. E-Science ‘06: Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, Amsterdam, IEEE Computer Society. Hey, T. and A. Trefethen (2003). The data deluge: an e-Science perspective. In F. Berman, A. Hey and G. Fox (eds) Grid Computing: Making the Global Infrastructure a Reality. Hoboken, NJ, John Wiley & Sons. Nentwich, M. (2003). Cyberscience. Research in the Age of the Internet.Vienna, Austrian Academy of Science Press. Unsworth, J. (2006). “The Draft Report of the American Council of Learned Societies’ Commission on Cyberinfrastructure for Humanities and Social Sciences.” from http://www.acls.org/cyberinfrastructure/. Building on the earlier investigations into the data deluge and how to deal with it, many of the second-phase projects look into the so-called ‘knowledge technologies’ that help with data and text mining as well as simulations in decision support for arts and humanities research. One example is the ‘Medieval Warfare on the Grid: The Case of Manzikert’ project in _____________________________________________________________________________ 61 Digital Humanities 2008 _____________________________________________________________________________ An OWL-based index of emblem metaphors Peter Boot peter.boot@huygensinstituut.knaw.nl Huygens Institute Intro This paper describes an index on metaphor in Otto Vaenius’ emblem book Amoris divini emblemata (Antwerp, 1615). The index should be interesting both for its contents (that is, for the information about the use of metaphor in the book) and as an example of modelling a complex literary phenomenon. Modelling a complex phenomenon creates the possibility to formulate complex queries on the descriptions that are based on the model.The article describes an application that uses this possibility. The application user can interrogate the metaphor data in multiple ways, ranging from canned queries to complex selections built in the application’s guided query interface. Unlike other emblem indices, the metaphor index is not meant to be a tool for resource discovery, a tool that helps emblem scholars find emblems relevant to their own research.It presents research output rather than input. The modelling techniques that it exemplifies should help a researcher formulate detailed observations or findings about his research subject – in this case, metaphor – and make these findings amenable to further processing. The result is an index, embedded in an overview or explanation of the data for the reader. I will argue that for research output data it is up to the researcher who uses these modelling techniques to integrate the presentation of data in a narrative or argument, and I describe one possible way of effecting this integration. The paper builds on the techniques developed in (Boot 2006). The emblem book is encoded using TEI; a model of metaphor (an ontology) is formulated in OWL; the observations about the occurrence of metaphors are stored as RDF statements. An essay about the more important metaphors in this book is encoded in TEI. This creates a complex and interlinked structure that may be explored in a number of ways.The essay is hyperlinked to (1) individual emblems, (2) the presentation of individual metaphors in emblems, (3) searches in the metaphor data, and (4) concepts in the metaphor ontology. From each of these locations, further exploration is possible. Besides these ready-made queries, the application also facilitates user-defined queries on the metaphor data. The queries are formulated using the SPARQL RDF query language, but the application’s guided query interface hides the actual syntax from the user. Richards (1936). When love, for its strength en endurance in adversity, is compared to a tree, the tree is the vehicle, love is the tenor. It is possible to define hierarchies, both for the comparands (that is, vehicles and tenors) and for the metaphors: we can state that ‘love as a tree’ (love being firmly rooted) belongs to a wider class of ‘love as a plant’ (love bearing fruit) metaphors. We can also state that a tree is a plant, and that it (with roots, fruit, leaves and seeds) belongs to the vegetal kingdom (Lakoff and Johnson 1980). It often happens that an emblem contains references to an object invested with metaphorical meaning elsewhere in the book. The index can record these references without necessarily indicating something they are supposed to stand for. The index can also represent the locations in the emblem (the text and image fragments) that refer to the vehicles and tenors. The text fragments are stretches of emblem text, the image fragments are rectangular regions in the emblem pictures. The index uses the TEI-encoded text structure in order to relate occurrences of the comparands to locations in the text. The metaphor model is formulated using the Web Ontology Language OWL (McGuinness and Van Harmelen 2004). An ontology models the kind of objects that exist in a domain, their relationships and their properties; it provides a shared understanding of a domain. On a technical level, the ontology defines the vocabulary to be used in the RDF statements in our model. The ontology thus limits the things one can say; it provides, in McCarty’s words (McCarty 2005), the ‘explicit, delimited conception of the world’ that makes meaningful manipulation possible. The ontology is also what ‘drives’ the application built for consultation of the metaphor index. See for similar uses of OWL: (Ciula and Vieira 2007), (ZöllnerWeber 2005). The paper describes the classes and the relationships between them that the OWL model contains. Some of these relationships are hierarchical (‘trees belong to the vegetal kingdom’), others represent relations between objects (‘emblem 6 uses the metaphor of life as a journey’ or ‘metaphor 123 is a metaphor for justice’). The relationships are what makes it possible to query objects by their relations to other objects: to ask for all the metaphors based in an emblem picture, to ask for all of the metaphors for love, or to combine these criteria. Application Metaphor model In order to present the metaphor index to a reader, a web application has been developed that allows readers to consult and explore the index. The application is an example of an ontology-driven application as discussed in (Guarino 1998): the data model, the application logic and the user interface are all based on the metaphor ontology. There is a number of aspects of metaphor and the texts where metaphors occur that are modelled in the metaphor index. A metaphor has a vehicle and a tenor, in the terminology of The application was created using PHP and a MySQL database backend. RAP, the RDF API for PHP, is used for handling RDF. RDF and OWL files that contain the ontology and occurrences _____________________________________________________________________________ 62 Digital Humanities 2008 _____________________________________________________________________________ are stored in an RDF model in the database. RDF triples that represent the structure of the emblem book are created from the TEI XML file that contains the digital text of the emblem book. The application has to provide insight into three basic layers of information: our primary text (the emblems), the database-like collection of metaphor data, and a secondary text that should make these three layers into a coherent whole.The application organizes this in three perspectives: an overview perspective, an emblem perspective and an ontology perspective. Each of these perspectives offers one or more views on the data.These views are (1) a basic selection interface into the metaphor index; (2) an essay about the use and meaning of metaphor in this book; (3) a single emblem display; (4) information about metaphor use in the emblem; and (5) a display of the ontology defined for the metaphor index (built using the OWLDoc). The paper will discuss the ways in which the user can explore the metaphor data. Figure 1 Part of the classes that make up the metaphor ontology. Arrows point to subclasses.The classes at the bottom level are just examples; many other could have been shown if more space were available.For simplicity, this diagram ignores class properties Discussion The metaphor index is experimental, among other things in its modelling of metaphor and in its use of OWL and RDF in a humanities context. If Willard McCarty is right in some respects all humanities computing is experimental. There is, however, a certain tension between the experimental nature of this index and the need to collect a body of material and create a display application. If the aim is not to support resource discovery, but solely to provide insight, do we then need this large amount of data? Is all software meant to be discarded, as McCarty quotes Perlis? The need to introduce another aspect of metaphor into the model may conflict with the need to create a body of material that it is worthwhile to explore. It is also true, however, that insight doesn’t come from subtlety alone. There is no insight without numbers. Figure 2 A metaphor and the properties relating it to the comparands Figure 3 Objects can be queried by their relations McCarty writes about the computer as ‘a rigorously disciplined means of implementing trial-and-error (…) to help the scholar refine an inevitable mismatch between a representation and reality (as he or she conceives it) to the point at which the epistemological yield of the representation has been realized’. It is true that the computer helps us be rigorous and disciplined, but perhaps for that very reason the representations that the computer helps us build may become a burden. Computing can slow us down.To clarify the conceptual structure of metaphor as it is used in the book we do not necessarily need a work of reference. The paper’s concluding paragraphs will address this tension. Figure 4 Overview perspective _____________________________________________________________________________ 63 Digital Humanities 2008 _____________________________________________________________________________ Figure 8 Expert search, start building query Figure 5 Clicking the hyperlink ‘plant life’ (top right) executes a query with hits shown in the left panel Figure 9 Expert search. Click ‘+’ to create more criteria Figure 6 Emblem perspective with one metaphor highlighted in picture and text Figure 10 Expert search. Select the desired criterion Figure 11 Expert search. Final state of query Figure 7 Ontology perspective , with display of class metaphor _____________________________________________________________________________ 64 Digital Humanities 2008 _____________________________________________________________________________ Collaborative tool-building with Pliny: a progress report John Bradley john.bradley@kcl.ac.uk King’s College London, UK, Figure 12 Expert search. Display of executed query and generated RDQL in results panel References Antoniou, Grigoris and Van Harmelen, Frank (2004), A Semantic Web Primer (Cooperative Information Systems; Cambridge (Ma); London: MIT Press). Boot, Peter (2006), ‘Decoding emblem semantics’, Literary and Linguistic Computing, 21 supplement 1, 15-27. Ciula, Arianna and Vieira, José Miguel (2007), ‘Implementing an RDF/OWL Ontology on Henry the III Fine Rolls’, paper given at OWLED 2007, Innsbruck. Guarino, Nicola (1998), ‘Formal Ontology and Information Systems’, in Nicola Guarino (ed.), Formal Ontology in Information Systems. Proceedings of FOIS’98, Trento, Italy, 6-8 June 1998 (Amsterdam: IOS Press), 3-15. Lakoff, George and Johnson, Mark (1980), Metaphors we live by (Chicago; London: University of Chicago Press). McCarty, Willard (2005), Humanities Computing (Basingstoke: Palgrave Macmillan). McGuinness, Deborah L. and Van Harmelen, Frank (2007), ‘OWL Web Ontology Language. Overview. W3C Recommendation 10 February 2004’, <http://www.w3.org/ TR/owl-features/>, accessed 2007-02-24. Richards, Ivor Armstrong (1936), The Philosophy of Rhetoric (New York, London: Oxford University Press). Zöllner-Weber, Amelie (2005), ‘Formale Repräsentation und Beschreibung von literarischen Figuren’, Jahrbuch für Computerphilologie – online, 7. In the early days the Digital Humanities (DH) focused on the development of tools to support the individual scholar to perform original scholarship, and tools such as OCP and TACT emerged that were aimed at the individual scholar. Very little tool-building within the DH community is now aimed generally at individual scholarship. There are, I think, two reasons for this: • First, the advent of the Graphical User Interface (GUI) made tool building (in terms of software applications that ran on the scholar’s own machine) very expensive. Indeed, until recently, the technical demands it put upon developers have been beyond the resources of most tool developers within the Humanities. • Second, the advent of the WWW has shifted the focus of much of the DH community to the web. However, as a result, tool building has mostly focused on not the doing of scholarly research but on the publishing of resources that represent the result of this. DH’s tools to support the publishing of, say, primary sources, are of course highly useful to the researcher when his/her primary research interest is the preparation of a digital edition. They are not directly useful to the researcher using digital resources. The problem (discussed in detail in Bradley 2005) is that a significant amount of the potential of digital materials to support individual research is lost in the representation in the browser, even when based on AJAX or Web 2.0 practices. The Pliny project (Pliny 2006-7) attempts to draw our attention as tool builders back to the user of digital resources rather than their creator, and is built on the assumption that the software application, and not the browser, is perhaps the best platform to give the user full benefit of a digital resource. Pliny is not the only project that recognises this. The remarkable project Zotero (Zotero 2006-7) has developed an entire plugin to provide a substantial new set of functions that the user can do within their browser. Other tool builders have also recognised that the browser restricts the kind of interaction with their data too severely and have developed software applications that are not based on the web browser (e.g. Xaira (2005-7), WordHoard (2006-7), Juxta (2007), VLMA (2005-7)). Some of these also interact with the Internet, but they do it in ways outside of conventional browser capabilities. Further to the issue of tool building is the wish within the DH community to create tools that work well together. This problem has often been described as one of modularity – building separate components that, when put together, _____________________________________________________________________________ 65 Digital Humanities 2008 _____________________________________________________________________________ allow the user to combine them to accomplish a range of things perhaps beyond the capability of each tool separately. Furthermore, our community has a particularly powerful example of modularity in Wilhelm Ott’s splendid TuStep software (Ott 2000). TuStep is a toolkit containing a number of separate small programs that each perform a rather abstract function, but can be assembled in many ways to perform a very large number of very different text processing tasks. However, although TuStep is a substantial example of software designed as a toolkit, the main discussion of modularity in the DH (going back as far as the CETH meetings in the 1990s) has been in terms of collaboration – finding ways to support the development of tools by different developers that, in fact, can co-operate. This is a very different issue from the one TuStep models for us. There is as much or more design work employed to create TuStep’s framework in which the separate abstract components operate (the overall system) as there is in the design of each component itself. This approach simply does not apply when different groups are designing tools semi-independently. What is really wanted is a world where software tools such as WordHoard can be designed in ways that allow other tools (such as Juxta) to interact in a GUI, on the screen. Why is this so difficult? Part of the problem is that traditional software development focuses on a “stack” approach. Layers of ever-more specific software are built on top of moregeneral layers to create a specific application, and each layer in the stack is aimed more precisely at the ultimate functions the application was meant to provide. In the end each application runs in a separate window on the user’s screen and is focused specifically and exclusively on the functions the software was meant to do. Although software could be written to support interaction between different applications, it is in practice still rarely considered, and is difficult to achieve. Instead of the “stack” model of software design, Pliny is constructed on top of the Eclipse framework (Eclipse 20057), and uses its contribution model based on Eclipse’s plugin approach (see a description of it in Birsan 2005).This approach promotes effective collaborative, yet independent, tool building and makes possible many different kinds of interaction between separately written applications. Annotation provides an excellent motivation for this. A user may wish to annotate something in, say, WordHoard. Later, this annotation will need to be shown with annotations attached to other objects from other pieces of software. If the traditional “stack” approach to software is applied, each application would build their own annotation component inside their software, and the user would not be able to bring notes from different tools together. Instead of writing separate little annotation components inside each application, Eclipse allows objects from one application to participate as “first-class” objects in the operation of another. Annotations belong simultaneously to the application in which they were created, and to Pliny’s annotation-note-taking management system. Pliny’s plugins both support the manipulation of annotations while simultaneously allowing other (properly constructed) applications to create and display annotations that Pliny manages for them. Furthermore, Pliny is able to recognise and represent references to materials in other applications within its own displays. See Figures I and II for examples of this, in conjunction with the prototype VLMA (2005-7) plugin I created from the standalone application produced by the VLMA development team. In Figure I most of the screen is managed by the VLMA application, but Pliny annotations have been introduced and combined with VLMA materials. Similarly, in figure II, most of the screen is managed by Pliny and its various annotation tools, but I have labelled areas on the screen where aspects of the VLMA application still show through. Pliny, then, is about two issues: • First, Pliny focuses on digital annotation and note-taking in humanities scholarship, and shows how they can be used facilitate the development of an interpretation. This has been described in previous papers and is not presented here. • Second, Pliny models how one could be building GUIoriented software applications that, although developed separately, support a richer set of interactions and integration on the screen. This presentation focuses primarily on this second theme, and is a continuation of the issues raised at last year’s poster session on this subject for the DH2007 conference (Bradley 2007). It arises from a consideration of Pliny’s first issue since note-taking is by its very nature an integrative activity – bringing together materials created in the context of a large range of resources and kinds of resources. Figure I: Pliny annotations in a VLMA viewer _____________________________________________________________________________ 66 Digital Humanities 2008 _____________________________________________________________________________ References Birsan, Dorian (2005). “On Plug-ins and Extensible Architectures”, In Queue (ACM),Vol 3 No 2. Bradley, John (2005). “What you (fore)see is what you get: Thinking about usage paradigms for computer assisted text analysis” in Text Technology Vol. 14 No 2. pp 1-19. Online at http://texttechnology.mcmaster.ca/pdf/vol14_2/bradley142.pdf (Accessed November 2007). Figure II:VLMA objects in a Pliny context This connecting of annotation to a digital object rather than merely to its display presents some new issues. What, for example, does it mean to link an annotation to a line of a KWIC display – should that annotation appear when the same KWIC display line appears in a different context generated as the result of a different query? Should it appear attached to the particular word token when the document it contains is displayed? If an annotation is attached to a headword, should it be displayed automatically in a different context when its word occurrences are displayed, or only in the context in which the headword itself is displayed? These are the kind of questions of annotation and context that can only really be explored in an integrated environment such as the one described here, and some of the discussion in this presentation will come from prototypes built to work with the RDF data application VLMA, with the beginnings of a TACT-like text analysis tool, and on a tool based on Google maps that allows one to annotate a map. Building our tools in contexts such as Pliny’s that allow for a complex interaction between components results in a much richer, and more appropriate, experience for our digital user. For the first time, perhaps, s/he will be able to experience the kind of interaction between the materials that are made available through applications that expose, rather than hide, the true potential of digital objects. Pliny provides a framework in which objects from different tools are brought into close proximity and connected by the paradigm of annotation. Perhaps there are also paradigms other than annotation that are equally interesting for object linking? Bradley, John (2007). “Making a contribution: modularity, integration and collaboration between tools in Pliny”. In book of abstracts for the DH2007 conference, UrbanaChampaign, Illinois, June 2007. Online copy available at http:// www.digitalhumanities.org/dh2007/abstracts/xhtml.xq?id=143 (Accessed October 2007). Eclipse 2005-7. Eclipse Project Website. At http://www.eclipse. org/ (Accessed October 2007). Juxta (2007). Project web page at http://www.nines.org/tools/ juxta.html (accessed November 2007). Ott, Wilhelm (2000). “Strategies and Tools for Textual Scholarship: the Tübingen System of Text Processing Programs (TUSTEP)” in Literary & Linguistic Computing, 15:1 pp. 93-108. Pliny 2006-7. Pliny Project Website . At http://pliny.cch.kcl.ac.uk (Accessed October 2007). VLMA (2005-7). VLMA: A Virtual Lightbox for Museums and Archives. Website at http://lkws1.rdg.ac.uk/vlma/ (Accessed October 2007). WordHoard (2006-7). WordHoard: An application for close reading and scholarly analysis of deeply tagged texts. Website at http://wordhoard.northwestern.edu/userman/index.html (Accessed November 2007). Xaira (2005-7). Xaira Page. Website at http://www.oucs.ox.ac. uk/rts/xaira/ (Accessed October 2007). Zotero (2006-7). Zotero: A personal research assistant. Website at http://www.zotero.org/ (Accessed September 2007). _____________________________________________________________________________ 67 Digital Humanities 2008 _____________________________________________________________________________ How to find Mrs. Billington? Approximate string matching applied to misspelled names This theatrical notice on page 938 of the Leader from 17.11.1860 highlights the limitations of OCR alone. Gerhard Brey gerhard.brey@kcl.ac.uk King’s College London, UK Manolis Christodoulakis m.christodoulakis@uel.ac.uk University of East London, UK In this paper we discuss some of the problems that arise when searching for misspelled names. We suggest a solution to overcome these and to disambiguate the names found. Introduction The Nineteenth-Century Serials Edition (NCSE) is a digital edition of six nineteenth-century newspaper and periodical titles. It is a collaborative project between Birkbeck College, King’s College London, the British Library and Olive Software Inc. funded by the UK’s Arts and Humanities Research Council. The corpus consists of about 100,000 pages that were microfilmed, scanned in and processed using optical character recognition software (OCR) to obtain images for display in the web application as well as the full text of the publications. In the course of this processing (by Olive Software) the text of each individual issue was also automatically segmented into its constituent parts (newspaper departments, articles, advertisements, etc.). The application of text mining techniques (named entity extraction, text categorisation, etc.) allowed names, institutions and places etc. to be extracted as well as individual articles to be classified according to events, subjects and genres. Users of the digital edition can search for very specific information. The extent to which these searches are successful depends, of course, on the quality of the available data. The problem The quality of some of the text resulting from the OCR process varies from barely readable to illegible. This reflects the poor print quality of the original paper copies of the publications. A simple search of the scanned and processed text for a person’s name, for example, would retrieve exact matches, but ignore incorrectly spelled or distorted variations. Misspellings of ordinary text could be checked against an authoritative electronic dictionary. An equivalent reference work for names does not exist. This paper describes the solutions that are being investigated to overcome these difficulties. The actress Mrs. Billington is mentioned twice. OCR recognised the name once as Mrs. lUllinijton and then as Mrs. BIIMngton. A simple search for the name Billington would therefore be unsuccessful. By applying a combination of approximate string matching techniques and allowing for a certain amount of spelling errors (see below) our more refined approach successfully finds the two distorted spellings as Mrs. Billington. However, it also finds a number of other unrelated names (Wellington and Rivington among others). This additional problem is redressed by mapping misspelled names to correctly spelled names. Using a combination of string similarity and string distance algorithms we have developed an application to rank misspelled names according to their likelihood of representing a correctly spelled name. The algorithm As already briefly mentioned above we are applying several well known pattern matching algorithms to perform approximate pattern matching, where the pattern is a given name (a surname normally), and the “text” is a (huge) list of names, obtained from the OCR of scanned documents. The novelty of this work comes from the fact that we are utilizing application-specific characteristics to achieve better results than are possible through general-purpose pattern matching techniques. Currently we are considering the pattern to be error-free (although our algorithms can easily be extended to deal with errors in the pattern too). Moreover, all the algorithms take as input the maximum “distance” that a name in the list may have from the pattern to be considered a match; this distance is given as a percentage. As one would expect, there is a tradeoff in distance - quality of matches: low distance threshold yields less false positives, but may miss some true matches; on the other hand, a high distance threshold has less chances of missing true matches, but will return many fake ones. _____________________________________________________________________________ 68 Digital Humanities 2008 _____________________________________________________________________________ At the heart of the algorithms lies a ranking system, the purpose of which is to sort the matching names according to how well they match. (Recall that the list of matching names can be huge, and thus what is more important is really the ranking of those matches, to distinguish the good ones from random matches.) Obviously, the ranking system depends on the distance-metric in use, but there is more to be taken into account. Notice that, if a name matches our pattern with some error, e, there are two cases: - either the name is a true match, and the error e is due to bad OCR, or - the name (misspelled or not, by OCR) does not correspond to our pattern, in which case it is a bad match and should receive a lower rank. It is therefore essential, given a possibly misspelled (due to OCR) name, s’, to identify the true name, s, that it corresponds to. Then, it is obvious that s’ is a good match if p = s, and a bad match otherwise, where p is the pattern.To identify whether a name s’ in our list is itself a true name, or has been mispelled we use two types of evaluation: Bibliography NCSE website: http://www.ncse.kcl.ac.uk/index.html Cavnar, William B. and John M. Trenkle. “N-Gram-Based Text Categorization”, in: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas 1994, 161-175; http://citeseer.ist.psu.edu/68861.html; accessed Nov 2007. Cavnar, William B. “Using An N-Gram-Based Document Representation With A Vector Processing Retrieval Model”, in: TREC, 1994, 269--277; http://trec.nist.gov/pubs/trec3/ papers/cavnar_ngram_94.ps.gz; accessed Nov 2007. Gusfield, Dan. Algorithms on strings, trees, and sequences: computer science and computational biology, Cambridge (CUP) 1997. Hall, Patrick A.V. and Geoff R. Dowling. “Approximate String Matching”, in: ACM Computing Surveys, 12.4 (1980), 381-402; http://doi.acm.org/10.1145/356827.356830; accessed November 2007 - We count the occurrences of s’ in the list. A name that occurs many times, is likely to be a true name; if it had been misspelled, it is very unlikely that all its occurrences would have been misspelled in exactly the same manner, by the OCR software. Jurafsky, Daniel Saul and James H. Martin. Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, Upper Saddle River, NJ (Prentice Hall) 2000, Chapter 5. - We have compiled a list L of valid names; these names are then used to decide whether s’ is a valid name (s’ ∈ L) or not (s’ ∉ L). Lapedriza, Àgata and Jordi Vitrià. “Open N-Grams and Discriminant Features in Text World: An Empirical Study”; http://www.cvc.uab.es/~jordi/AgataCCIA2004.pdf; accessed November 2007. In the case where s’ is indeed mispelled by OCR, and is thus not a true name, one must use distance metrics to identify how closely it matches the pattern p. Given that the pattern is considered to be error-free, if the distance of the name from the pattern is large then it is very unlikely (but not impossible) that too many of the symbols of the name have been mispelled by OCR; instead, most probably the name does not really match the pattern. Taking into account the nature of the errors that occur in our application, when computing the distance of a name in the list from our pattern, we consider optical similarities. That is, we drop the common tactic where one symbol is compared against another symbol, and either they match - so the distance is 0, or they don’t - so the distance is 1; instead, we consider a symbol (or a group of symbols) to have a low distance from another symbol (or group of symbols) if their shapes look similar. As an example, check that “m” is optically very similar to “rn”, and thus should be assigned a small distance, say 1, while “m” and “b” do not look similar to each other and therefore should have a big distance, say 10. Navarro, Gonzalo. “A guided tour to approximate string matching”, in: ACM Computing Surveys, 33.1 (2001), 31-88; http://doi.acm.org/10.1145/375360.375365; accessed November 2007 Navarro, Gonzalo and Mathieu Raffinot. Flexible pattern matching in strings: practical on-line search algorithms for texts and biological sequences, Cambridge (CUP) 2002. Oakes, Michael P. Statistics for corpus linguistics, Edinburgh (Edinburgh Univ. Press) 1998, (Edinburgh textbooks in empirical linguistics, XVI), Chapter 3.4 The results of our efforts to date have been very promising. We look forward to investigating opportunities to improve the effectiveness of the algorithm in the future. _____________________________________________________________________________ 69 Digital Humanities 2008 _____________________________________________________________________________ Degrees of Connection: the close interlinkages of Orlando Susan Brown sbrown@uoguelph.ca University of Guelph, Canada Patricia Clements patricia.clements@ualberta.ca University of Alberta, Canada Isobel Grundy Isobel.Grundy@UAlberta.ca University of Alberta, Canada Stan Ruecker sruecker@ualberta.ca University of Alberta, Canada Jeffery Antoniuk jefferya@ualberta.ca University of Alberta, Canada Sharon Balazs Sharon.Balazs@ualberta.ca University of Alberta, Canada Orlando: Women’s Writing in the British Isles from the Beginnings to the Present is a literary-historical textbase comprising more than 1,200 core entries on the lives and writing careers of British women writers, male writers, and international women writers; 13,000+ free-standing chronology entries providing context; 12,000+ bibliographical listings; and more than 2 million tags embedded in 6.5 million words of born-digital text. The XML tagset allows users to interrogate everything from writers’ relationships with publishers to involvement in political activities or their narrative techniques. The current interface allows users to access entries by name or via various criteria associated with authors; to create chronologies by searching on tags and/or contents of dated materials; and to search the textbase for tags, attributes, and text, or a combination. The XML serves primarily to structure the materials; to allow users to draw on particular tags to bring together results sets (of one or more paragraphs incorporating the actual ‘hit’) according to particular interests; and to provide automatic hyperlinking of names, places, organizations, and titles. Recognizing both that many in our target user community of literary students and scholars dislike tag searches, and that our current interface has not fully plumbed the potential of Orlando’s experimentation in structured text, we are exploring what other kinds of enquiry and interfaces the textbase can support.We report here on some investigations into new ways of probing and representing the links created by the markup. The current interface leverages the markup to provide contexts for hyperlinks. Each author entry includes a “Links” screen that provides hyperlinks to mentions of that author elsewhere in the textbase. These links are sorted into groups based on the semantic tags of which they are children, so that users can choose, for instance, from the more than 300 links on the George Eliot Links screen, between a link to Eliot in the Elizabeth Stuart Phelps entry that occurs in the context of Family or Intimate Relationships, and a link to Eliot in Simone de Beauvoir’s entry that occurs in the discussion of features of de Beauvoir’s writing. Contextualizing Links screens are provided not only for writers who have entries, but also for any person who is mentioned more than once in the textbase, and also for titles of texts, organizations, and places. It thus provides a unique means for users to pursue links in a less directed and more informed way than that provided by many interfaces. Building on this work, we have been investigating how Orlando might support queries into relationships and networking, and present not just a single relationship but the results of investigating an entire field of interwoven relationships of the kind that strongly interest literary historians. Rather than beginning from a known set of networks or interconnections, how might we exploit our markup to analyze interconnections, reveal new links, or determine the points of densest interrelationship? Interface design in particular, if we start to think about visualizing relationships rather than delivering them entirely in textual form, poses considerable challenges. We started with the question of the degrees of separation between different people mentioned in disparate contexts within the textbase. Our hyperlinking tags allow us to conceptualize links between people not only in terms of direct contact, that is person-to-person linkages, but also in terms of linkages through other people, places, organizations, or texts that they have in common. Drawing on graph theory, we use the hyperlinking tags as key indices of linkages. Two hyperlinks coinciding within a single document—an entry or a chronology event--were treated as vertices that form an edge, and an algorithm was used to find the shortest path between source and destination. So, for instance, you can get from Chaucer to Virginia Woolf in a single step in twelve different ways: eleven other writer entries (including Woolf’s own) bring their names together, as does the following event: 1 November 1907: The British Museum’s reading room reopened after being cleaned and redecorated; the dome was embellished with the names of canonical male writers, beginning with Chaucer and ending with Browning. Virginia Woolf’s A Room of One’s Own describes the experience of standing under the dome “as if one were a thought in the huge bald forehead which is so splendidly encircled by a band of famous names.” Julia Hege in Jacob’s Room complains that they did not leave room for an Eliot or a Brontë. It takes more steps to get from some writers to others: five, for instance, to get from Frances Harper to Ella Baker. But this is very much the exception rather than the rule. _____________________________________________________________________________ 70 Digital Humanities 2008 _____________________________________________________________________________ Calculated according to the method described here, we have a vast number of links: the textbase contains 74,204 vertices with an average of 102 edges each (some, such as London at 101,936, have considerably more than others), meaning that there are 7.5 million links in a corpus of 6.5 million words. Working just with authors who have entries, we calculated the number of steps between them all, excluding some of the commonest links: the Bible, Shakespeare, England, and London. Nevertheless, the vast majority of authors (on average 88%) were linked by a single step (such as the example of Chaucer and Woolf, in which the link occurs within the same source document) or two steps (in which there is one intermediate document between the source and destination names). Moreover, there is a striking similarity in the distribution of the number of steps required to get from one person to another, regardless of whether one moves via personal names, places, organizations, or titles. 10.6% of entries are directly linked, that is the two authors are mentioned in the same source entry or event. Depending on the hyperlinking tag used, one can get to the destination author with just one intermediate step, or two degrees of separation, in 72.2% to 79.6% of cases. Instances of greater numbers of steps decline sharply, so that there are 5 degrees of separation in only 0.6% of name linkage pages, and none at all for places. Six degrees of separation does not exist in Orlando between authors with entries, although there are a few “islands”, in the order of from 1.6% to 3.2%, depending on the link involved, of authors who do not link to others. These results raise a number of questions. As Albert-Lászlo Barabási reported of social networks generally, one isn’t dealing with degrees of separation so much as degrees of proximity. However, in this case, dealing not with actual social relations but the partial representation in Orlando of a network of social relations from the past, what do particular patterns such as these mean? What do the outliers—people such as Ella Baker or Frances Harper who are not densely interlinked with others—and islands signify? They are at least partly related to the brevity of some entries, which can result either from paucity of information, or decisions about depth of treatment, or both. But might they sometimes also represent distance from literary and social establishments? Furthermore, some linkages are more meaningful, in a literary historical sense, than others. For instance, the Oxford Dictionary of National Biography is a common link because it is frequently cited by title, not because it indicates a historical link between people. Such incidental links can’t be weeded out automatically. So we are investigating the possibility of using the relative number of single- or double-step links between two authors to determine how linked they ‘really’ are. For instance, Elizabeth Gaskell is connected to William Makepeace Thackeray, Charles Dickens, and George Eliot by 25, 35, and 53 single-step links, respectively, but to Margaret Atwood, Gertrude Stein, and Toni Morrison by 2, 1, and 1. Such contrasts suggest the likely utility of such an approach to distinguishing meaningful from incidental associations. The biggest question invited by these inquiries into linkages is: how might new modes of inquiry into, or representation of, literary history, emerge from such investigations? One way to address this question is through interfaces. We have developed a prototype web application for querying degrees of separation in Orlando, for which we are developing an interface. Relationships or associations are conventionally represented by a network diagram, where the entities are shown as nodes and the relationships as lines connecting the nodes. Depending on the content, these kinds of figures are also referred to as directed graphs, link-node diagrams, entity-relationship (ER) diagrams, and topic maps. Such diagrams scale poorly, since the proliferation of items results in a tangle of intersecting lines. Many layout algorithms position the nodes to reduce the number of crisscrossing lines, resulting in images misleading to people who assume that location is meaningful. In the case of Orlando, two additional complexities must be addressed. First, many inter-linkages are dense: there are often 50 distinct routes between two people. A conventional ER diagram of this data would be too complex to be useful as an interactive tool, unless we can allow the user to simplify the diagram. Second, the Orlando data differs from the kind of data that would support “distant reading” (Moretti 1), so our readers will need to access the text that the diagram references. How, then, connect the diagram to a reading view? We will present our preliminary responses to these challenges in an interface for degree of separation queries and results.We are also experimenting with the Mandala browser (Cheypesh et al. 2006) for XML structures as a means of exploring embedded relationships. The current Mandala prototype cannot accommodate the amount of data and number of tags in Orlando, so we will present the results of experimenting with a subset of the hyperlinking tags as another means of visualizing the dense network of associations in Orlando’s representations of literary history. References Barabási, Albert-Lászlo. Linked:The New Science of Networks. Cambridge, MA: Perseus Publishing, 2002. Brown, Susan, Patricia Clements, and Isobel Grundy, ed. Orlando:Women’s Writing in the British Isles from the Beginnings to the Present. Cambridge: Cambridge Online, 2006. Cheypesh, Oksana, Constanza Pacher, Sandra Gabriele, Stéfan Sinclair, Drew Paulin and Stan Ruecker. “Centering the mind and calming the heart: mandalas as interfaces.” Paper presented at the Society for Digital Humanities (SDH/SEMI) conference.York University, Toronto. May 29-31, 2006. Mandala Rich Prospect Browser. Dir. Stéfan Sinclair and Stan Ruecker. http://mandala.humviz.org Accessed 22 November 2007. Moretti, Franco. Graphs, Maps,Trees: Abstract Models for a Literary Theory. London:Verso, 2005. _____________________________________________________________________________ 71 Digital Humanities 2008 _____________________________________________________________________________ The impact of digital interfaces on virtual gender images Sandra Buchmüller sandra.buchmueller@telekom.de Deutsche Telekom Laboratories, Germany Gesche Joost gesche.joost@telekom.de Deutsche Telekom Laboratories, Germany Rosan Chow rosan.chow@telekom.dt Deutsche Telekom Laboratories, Germany This paper documents an exploratory research in progress, investigating the relationship between the quality of digital interfaces, their technical conditions and the interfacial mediation of the users’ body. Here, we focus on the bodily dimension of gender. For this reason, we analysed two online role playing games with different representation modes (text-based versus image-based) asking which impact the digital interface and their technical conditions have on gender performances. Following sociological investigations (Bahl, 1998/ Goffman, 2001/ Lübke, 2005/ Müller, 1996), the bodily aspect of gender plays a central role in communication settings due to its social, cultural meaning which nowadays strongly is mediated by information technology. For this reason, we focus on the interfaces. We claim that their representation mode, their design and software constraints have a crucial impact on the virtual patterns of gender referring to individual performance, spatial movement and communication. This interfacial analysis is just a part of an interdisciplinary inquiry about the interrelation between gender, design and ICT. It is allocated in the overlapping field of sociology, gender studies, design research and information technology. In this respect, we regard the body and its gender as culturally constructed interface of social interactions and advocate for reflecting it within the process software development and interface design. Introduction Today’s communication is highly influenced by information technology, which substitutes more and more physical body representations in face-to-face communication. Disembodied experiences have become a part of ordinary life as self performances and interactions are often mediated by designed hardware and software interfaces. In this context, designers who make ICT accessible via their product and screen designs can be regarded as mediators between technological requirements and user needs. They usually regard interfaces from the point of view of formal-aesthetic (Apple Computer, Inc, 1992; McKey, 1999; Schneiderman/ Plaisant, 2005), or of usability (Krug, 2006; Nielsen/ Loranger, 2006) and interaction-design (Cooper/ Reimann, 2003; Preece/ Rogers/ Sharp, 2002). But designers not only make ICT usable, but also culturally significant. In this respect, they also deal with the cultural constructions and implications of gender which have to be reflected by their screen designs. The interfacial conditions decide about the bodily representations in ICT interaction. By referring to interactionist (Goffman, 2001), constructivist (Butler, 2006/ Teubner, Wetterer, 1999/ Trettin, 1997/ West, Zimmermann, 1991, 1995) theories and sociological investigations of virtual environments (Eisenrieder, 2003/ Lübke, 2005/ Turkle, 1999), we claim that the bodily aspect of gender is an essential reference point of individual performance, communication, even spatial movements not as a physical property of the body, but due to its cultural meaning and connotative potential. It is supposed to be the most guiding information referring to interaction contexts (Goffman, 2001/ Lübke, 2005). It has a crucial impact on the behavior: Being polite, e.g. opening the door for someone is often a decision made in dependence of the counterpart’s gender (Müller, 1996). Not knowing about it causes behavioral confusion (Bahl, 1998/ Lübke, 2005). Research Perspectives & Research Questions In contrast to a sociological point of view, a conventional design perspective or technological requirements, we add a completely new aspect to design and technologically driven inquiries in two respects: - By taking the body as a benchmark for analyzing gender representations of digital interfaces. - By investigating the virtual body and its gender representations from an interfacial point of view. Our investigations and reflections are guided by the following questions: - Which gender images do exist in virtual spaces? - Which impact do the digital interface and their technical conditions have on gender performances? Furthermore, we are particularly interested in knowing about the impact of design on preserving traditional and stereotype gender images as well as on modifying or deconstructing them. _____________________________________________________________________________ 72 Digital Humanities 2008 _____________________________________________________________________________ Objects of Investigation Methodology Objects of investigations are two online role playing games, so called Multi User Dungeons (MUDs). They are especially suitable for this purpose because they directly refer to bodily representations in form of virtual characters. We choose two MUDs with different representation modes in order to compare the effects of opposite interfaces on the virtual embodiment of gender: LambdaMOO, a popular text-based online role playing game (see Image 1 LM), is contrasted with Second Life, the currently most popular and populated graphical MUD (see Image 1 SL). Examining their interfaces promise to get concrete insights into the interfacial influence on virtual gender images. We use the methods of content analysis and participatory observation – the first in order to explore the interfacial offer of options and tools to create virtual gender representations and the second in order to know, how it feels developing and wearing the respective gendered skin. The analysis and observations are guided and structured using the different dimensions of the body as a matrix of investigating which are empirically generated from the observations themselves. Within these bodily categories, different aspects of gender are identified: Body presence - modes of existence or being there Personality / individuality - forms of personal performance - forms of non-verbal communication (facial expressions, gestures, vocal intonations and accentuation) - modes of emotional expressions Patterns of gender - categories or models of gender Actions and spatial Movements - modes of behavior and actions - modes of spatial orientation and navigation Patterns of social behavior - modes of communication and interaction Image 1 LM: Interface - community standards, behavioral guidelines and rules Research Results The main findings show, how interfaces and their specific technical conditions can expand or restrict the performance of gender. Moreover, they demonstrate how the conventional bipolar gender model of western culture can be re- and deconstructed by the interfacial properties and capacities. Image 1 SL: Interface The table below gives a summarizing overview about how the different bodily aspects are treated by the respective interface of LambdaMOO or Second Life. Interfacial elements which explicitly address gender aspects are indicated in bold italics. Reference to images are marked “see Image # LM (LambdaMOO)/SL(Second Life)”. _____________________________________________________________________________ 73 Digital Humanities 2008 _____________________________________________________________________________ Body aspect LambdaMOO (text-based) Second Life (image-based) Body presence (addressing more or less explicitly cultural associations of gender) Nickname: Renaming is possible at any time (See Image 2.1 LM and Image 2.2 LM) Nickname: name is predetermined by the interface (surname is selectable from a list, the availability of the combination of fore and sure name has to be checked by the system); name can‘t be changed after being registered (See Image 2.1 SL and Image 2.2 SL) Avatar: after registration, a character has to be selected of a default set of 6 male and female avatars (See Image 3 SL) which can be individually modified by the ‘Appearance Editor’ Personality/ individuality Actions and spatial movements Patterns of gender Self-descriptions command - Appearance Editor: just offers the gender categories ‘male and female’, but allows to create transgender images (See Image 7.1 SL and Image 7.2 SL) - Profile Window Emote-command Set of gestures: differentiated into common and male/ female gestures corresponding to the avatar’s gender (See Image 9.1 SL, Image 9.2 SL, Image 10.1 SL and Image 10.2 SL) Emote-command Set of gestures: differentiated into common and male/ female gestures corresponding to the avatar’s gender - Moving-around commands: (cardinal points + up/down) - Look-around commands - Navigation tool: the avatar’s movements are programmed corresponding to gender stereotypes: female avatars walk with swinging hips while male avatars walk bowlegged (See Image 8.1 SL) - Camera-view tool - Sit down / Stand up command: the avatar’s movements are programmed corresponding to gender stereotypes: female avatars sit close-legged, while male avatars sit bowlegged (See Image 8.2 SL) - Teleporting - Examine-object commands - Take-object command 10 gender categories: Neuter, male, female, either, Spivak (“To adopt the Spivak gender means to abjure the gendering of the body ,to refuse to be cast as male, female or transsexual.”Thomas, 2003)), splat, plural, egoistical, royal, 2nd (See Image 4.1 LM and Image 4.2 LM) Male, female: Changing the gender category is possible at any time by the ‘Appearance Editor’ (See Image 6 SL); naked male avatars are sexless(See Image 5.1 SL); female avatars have breast but also no genital; masculinity can be emphasized by editing the genital area of the avatar‘s trousers (See Image 5.2 SL and Image 5.3) femininity by the size of breast and their position to each other (together or more distant) Image 2.1 LM: Rename command Image 2.1 SL: Choose surname Image 2.2 LM: Nickname Magenta_Guest Set of male / female gestures (See Image 10.1 SL and Image 10.2 SL): Female gesture set with 18 versus male gesture set with 12 categories, may support the stereotype of females being more emotionally expressive Virtual patterns of social behavior Chat-command: say Chat-commands: speak / shout Some gestures are accompanied by vocal expressions which are differentiated into male and female voices Long-distancecommunication command Instant Messenger Behavioral guidelines Community Standards: ‘Big Six’& Policy includes the topic ‘harassment’ which mainly affects females personas Image 2.2 SL: Name check Abuse Reporter Tool Reporting violations directly to the creators of Second Life ‘Linden Lab’ _____________________________________________________________________________ 74 Digital Humanities 2008 _____________________________________________________________________________ Image 5.3 SL: Edit masculinity - big Image 3 SL: First set of avatars Image 4.1 LM: Gender categories Image 4.2 LM: Re-gender Image 6 SL: Appearance editor Image 5.1 SL: Male avatar without genitals Image 7.1 SL:Transgender Image 5.2 SL: Edit masculinity - small _____________________________________________________________________________ 75 Digital Humanities 2008 _____________________________________________________________________________ Image 9.2 SL: Common gestures - male avatar Image 7.2 SL: Gender swap from female to male Image 10.1 SL: Female gestures - female avatar Image 8.1 SL:Walking avatars Image 10.2 SL: Male gestures - male avatar Image 8.2 SL: Sitting bowlegged/close legged The comparison demonstrates that both interfaces use nearly the same patterns of embodiment. Referring to the bodily aspect of gender, Second Life is a poor and conservative space. Gender modifications besides the common model seem not to be intended by the interface: In case of gender switches the gender specific gestures set does not shift correspondently (see images 11.1 SL and11.2 SL). Image 9.1 SL: Common gestures - female avatar _____________________________________________________________________________ 76 Digital Humanities 2008 _____________________________________________________________________________ Image 11.1 SL: Female gestures - male avatar Bahl, Anke (1996): Spielraum für Rollentäuscher. Muds: Rollenspielen im Internet: http://unitopia.uni-stuttgart.de/ texte/ct.html (also in: c’t, 1996, Heft 8) Becker, Barbara (2000): Elektronische Kommunikationsmedien als neue „Technologien des Selbst“? Überlegungen zur Inszenierung virtueller Identitäten in elektronischen Kommunikationsmedien. In Huber, Eva (Hrsg.): Technologien des Selbst. Zur Konstruktion des Subjekts, Basel/ Frankfurt a. M., P. 17 – 30 Bratteteig, Tone (2002): Bridging gender issues to technology design. In Floyd, C. et al.: Feminist Challenges in the Information Age. Opladen: Leske + Budrich, p. 91-106. Image 11.2 SL: Male gestures - female avatar In case of a male-to-female modification, the avatars clothes do not fit any more (see image 12 SL). Butler, J. (2006): Das Unbehagen der Geschlechter, 1991, Suhrkamp Frankfurt a. M. Clegg, S. and Mayfield, W. (1999): Gendered by design. In Design Issues 15 (3), P. 3-16 Cooper, Alan and Reimann, Robert (2003): About Face 2.0.The Essentials of Interaction Design, Indianapolis Eisenrieder,Veronika (2003): Von Enten,Vampiren und Marsmenschen - Von Männlein,Weiblein und dem “Anderen”. Soziologische Annäherung an Identität, Geschlecht und Körper in den Weiten des Cyberspace, München Goffman, Erving (2001); Interaktion und Geschlecht, 2. Auflage, Frankfurt a. M. Image 12 SL: Female to male swap - unfitting clothes In contrast to that, LambdaMOO questions the analogue gender model by violating and expanding it up to 10 categories. Outlook This inquiry belongs to a design research project which generally deals with the relation between gender and design. It aims at the development of a gender sensitive design approach investigating the potential of gender modification and deconstruction by design. References Apple Computer, Inc. (1992): Macintosh Human Interface Guidelines (Apple Technical Library), Cupertino Bahl, Anke (1998): MUD & MUSH. Identität und Geschlecht im Internet Ergebnisse einer qualitativen Studie. In Beinzger, Dagmar/ Eder, Sabine/Luca, Renate/ Röllecke, Renate (Hrsg.): Im Weiberspace - Mädchen und Frauen in der Medienlandschaft, Schriften zur Medienpädagogik Nr.26, Gesellschaft für Medienpädagogik und Kommunikationskultur, Bielefeld, p. 138 – 151 Kleinen, Barbara (1997): Körper und Internet - Was sich in einem MUD über Grenzen lernen lässt: http://tal.cs.tu-berlin. de/~finut/mud.html (or in Bath, Corinna and Kleinen, Barbara (eds): Frauen in der Informationsgesellschaft: Fliegen oder Spinnen im Netz? Mössingen-Talheim, p. 42-52. Krug, Steve (2006): Don’t make me think! Web Usability. Das intuitive Web, Heidelberg Lin,Yuwei (2005): Inclusion, diversity and gender equality: Gender Dimensions of the Free/Libre Open Source Software Development. Online: opensource.mit.edu/papers/lin3_ gender.pdf (9.5.2007) Lübke,Valeska (2005): CyberGender. Geschlecht und Körper im Internet, Königstein McKey, Everett N. (1999): Developing User Interfaces for Microsoft Windows, Washington Moss, Gloria, Gunn, Ron and Kubacki, Krysztof (2006): Successes and failures of the mirroring principle: the case of angling and beauty websites. In International Journal of Consumer Studies,Vol. 31 (3), p 248-257. _____________________________________________________________________________ 77 Digital Humanities 2008 _____________________________________________________________________________ Müller, Jörg (1996):Virtuelle Körper. Aspekte sozialer Körperlichkeit im Cyberspace unter: http://duplox.wz-berlin. de/texte/koerper/#toc4 (oder WZB Discussion Paper FS II 96-105, Wissenschaftszentrum Berlin Towards a model for dynamic text editions Nielsen, Jakob and Loranger, Hoa (2006): Web Usability, München buzzetti@philo.unibo.it University of Bologna, Italy Oudshoorn, N., Rommes, E. and Stienstra, M. (2004): Configuring the user as everybody: Gender and design cultures in information and communication technologies. In Science,Technologie and Human Values 29 (1), P. 30-63 Preece, Jennifer, Rogers,Yvonne and Sharp, Helen (2002): Interaction Design. Beyond human-computer interaction, New York Rex, Felis alias Richards, Rob: http://www.LambdaMOO.info/ Rommes, E. (2000): Gendered User Representations. In Balka, E. and Smith, R. (ed.): Women,Work and Computerization. Charting a Course to the Future. Dodrecht, Boston: Kluwer Academic Pub, p. 137-145. Second Life Blog: April 2007 Key Metrics Released http://blog. secondlife.com/2007/05/10/april-2007-key-metrics-released/, http://s3.amazonaws.com/static-secondlife-com/economy/ stats_200705.xls Teubner, U. and A. Wetterer (1999): Soziale Konstruktion transparent gemacht. In Lober, J. (Editor): Gender Paradoxien, Leske & Budrich: Opladen, p. 9-29. Thomas, Sue (2003): www.barcelonareview.com/35/e_st.htm, issue 35, March - April Trettin, K. (1997): Probleme des Geschlechterkonstruktivismus. In: G.Völger, Editor: Sie und Er, Rautenstrauch-Joest-Museum, Köln Turkle, Sherry (1986): Die Wunschmaschine. Der Computer als zweites Ich, Reinbek bei Hamburg Turkle, Sherry (1999): Leben im Netz. Identität in Zeiten des Internet, Reinbek bei Hamburg West, C., Zimmermann, D.H. (1991): Doing Gender. In Lorber, J. and Farell, S. A. (eds): The Social Construction of Gender, Newbury Park/ London, p. 13 – 37 West, C., Zimmerman, D.H. (1995): Doing Difference. Gender & Society, 9, p. 8-37. Dino Buzzetti Malte Rehbein malte.rehbein@nuigalway.ie National University of Ireland, Galway, Ireland Creating digital editions so far is devoted for the most part to visualisation of the text. The move from mental to machine processing, as envisaged in the Semantic Web initiative, has not yet become a priority for the editorial practice in a digital environment. This drawback seems to reside in the almost exclusive attention paid until now to markup at the expense of textual data models. The move from “the database as edition” [Thaller, 1991: 156-59] to the “edition as a database” [Buzzetti et al., 1992] seems to survive only in a few examples. As a way forward we might regard digital editions to care more about processing textual information rather than just being satisfied with its visualisation. Here we shall concentrate on a recent case study [Rehbein, forthcoming], trying to focus on the kind of logical relationship that is established there between the markup and a database managing contextual and procedural information about the text. The relationship between the markup and a data model for textual information seems to constitute the clue to the representation of textual mobility. From an analysis of this kind of relationship we shall tentatively try to elicit a dynamic model to represent textual phenomena such as variation and interpretation. I. The case study uses the digital edition of a manuscript containing legal texts from the late medieval town Göttingen. The text shows that this law was everything else but unchangeable. With it, the city council reacted permanently on economical, political or social changes, thus adopting the law to a changing environment. The text is consequently characterised by its many revisions made by the scribes either by changing existing text or creating new versions of it.What has come to us is, thus, a multi-layered text, reflecting the evolutionary development of the law. In order to visualise and to process the text and its changes, not only the textual expression but, what is more, its context has to be regarded and described: when was the law changed, what was the motivation for this and what were the consequences? Answers to these questions are in fact required in order to reconstruct the different layers of the text and thereby the evolution of the law. Regarding the text nowadays, it is however not always obvious how to date the alterations. It is sometimes even not clear to reveal their chronological order. _____________________________________________________________________________ 78 Digital Humanities 2008 _____________________________________________________________________________ A simple example shall prove this assumption. Consider the sentence which is taken from the Göttingen bylaws about beer brewing we ock vorschote 100 marck, de darf 3 warve bruwen together with 150 as a replacement for 100 and 2 as a replacement for 3. (The meaning of the sentence in Low Middle German is: one, who pays 100 (150) marks as taxes, is allowed to brew beer 3 (2) times a year.) Without additional information, the four following readings are allowed, all representing different stages of the textual development: R1: we ock vorschote 100 marck, de darf 3 warve bruwen R2: we ock vorschote 100 marck, de darf 2 warve bruwen R3: we ock vorschote 150 marck, de darf 3 warve bruwen R4: we ock vorschote 150 marck, de darf 2 warve bruwen With some more information (mainly palaeographical) but still limited knowledge, three facts become clear: firstly, that R1 is the oldest version of the text, secondly that R4 is its most recent and thirdly that either R2 or R3 had existed as text layers or none of them but not both. But what was, however, the development of this sentence? Was it the path directly from R1 to R4? Or do we have to consider R1 > R2 > R4 or R1 > R3 > R4? In order to answer these questions we need to know about the context of the text, something that can not be found in the text itself. It is the external, procedural and contextual knowledge that has to be linked to the textual expression in order to fully analyse and edit the text. Textual mobility in this example means that, to a certain extent, the textual expression itself, its sequence of graphemes, can be regarded as invariant and objective, the external knowledge about its context cannot. It is essential in our case study not only to distinguish between the expression and the context of the text but what is more to allow flexibility in the definition and reading of (possible) text layers. It became soon clear, that for both, visualising and processing a dynamic text, a new understanding of an edition is needed, and, as a consequence, the mark-up strategy has to be reconsidered. This new understanding would “promote” the reader of an edition to its user, by making him part of it in a way that his external knowledge, his contextual setting would have influence on the representation of the text. Or in other words: dynamic text requires dynamic representation. The way chosen in this study is to regard textual expression and context (external knowledge) separately. The expression is represented by mark-up, encoding the information about the text itself. Regarding this stand-alone, the different units of the text (its oldest version, its later alterations or annotations) could indeed be visualised but not be brought into a meaningful relationship to each other. The latter is realised by a database providing structured external information about the text, mainly what specific “role” a certain part of the text “plays” in the context of interest. Only managing and processing both, markup and database, will allow to reconstruct the different stages of the text and consequently to represent the town law in its evolutionary development. Using the linkage mechanism between mark-up and database, the whole set of information is processable. In order to create a scholarly edition of the text, we can automatically produce a document that fulfils TEI conformity to allow the use of the widely available tools for transformation, further processing and possibly interchange. II. The case study just examined shows that in order to render an edition processable we have to relate the management system of the relevant data model to the markup embedded in the text. But we cannot provide a complete declarative model of the mapping of syntactic markup structures onto semantic content structures. The markup cannot contain a complete content model, just as a content model cannot contain a complete and totally definite expression of the text. To prove this fact we have to show that a markup description is equivalent to a second-order object language self-reflexive description and recall that a second-order logical theory cannot be complete. So the mapping cannot be complete, but for the same reason it can be categorical; in other words, all the models of the text could be isomorphic. So we can look for general laws, but they can provide only a dynamic procedural model. Let us briefly outline the steps that lead to this result. In a significant contribution to the understanding of “the meaning of the markup in a document,” [Sperberg-McQueen, Huitfeldt, and Renear, 2000: 231] expound it as “being constituted,” and “not only described,” by “the set of inferences about the document which are licensed by the markup.” This view has inspired the BECHAMEL Markup Semantics Project, a ground breaking attempt to specify mechanisms “for bridging [...] syntactic relationships [...] with the distinctive semantic relationships that they represent” [Dubin and Birnbaum, 2004], and to investigate in a systematic way the “mapping [of] syntactic markup structures [on]to instances of objects, properties, and relations” [Dubin, 2003] that could be processed trough an appropriate data model. Following [Dubin and Birnbaum, 2004], “that markup can communicate the same meaning in different ways using very different syntax”, we must conclude that “there are many ways of expressing the same content, just as there are many ways of assigning a content to the same expression” [Buzzetti, forthcoming]. The relationship between expression and content is then an open undetermined relationship that can be formalized by taking into account the “performative mood” of markup [Renear, 2001: 419]. For, a markup element, or any textual mark for that matter, is ambivalent: it can be seen as part of the _____________________________________________________________________________ 79 Digital Humanities 2008 _____________________________________________________________________________ text, or as a metalinguistic description/ indication of a certain textual feature. Linguistically, markup behaves as punctuation, or as any other diacritical mark, i.e. as the expression of a reflexive metalinguistic feature of the text. Formally, markup behaves just as Spencer-Brown’s symbols do in his formal calculus of indications [1969]: a symbol in that calculus can express both an operation and its value [Varela, 1979: 110111]. Bibliography Markup adds structure to the text, but it is ambivalent. It can be seen as the result of a restructuring operation on the expression of the text (as a textual variant) or as an instruction to a restructuring operation on the content of the text (as an interpretational variant). By way of its ambivalence it can act as a conversion mechanism between textual and interpretational variants [Buzzetti and McGann, 2006: 66] [Buzzetti, forthcoming]. [Buzzetti, 2004] Buzzetti, Dino. ‘Diacritical Ambiguity and Markup,’ in D. Buzzetti, G. Pancaldi, and H. Short (eds.), Augmenting Comprehension: Digital Tools and the History of Ideas, London-Oxford, Office for Humanities Communication, 2004, pp. 175-188: URL = <http://137.204.176.111/dbuzzetti/ pubblicazioni/ambiguity.pdf> [Buzzetti, 1992] Buzzetti, Dino, Paolo Pari e Andrea Tabarroni. ‘Libri e maestri a Bologna nel xiv secolo: Un’edizione come database,’ Schede umanistiche, n.s., 6:2 (1992), 163-169. [Buzzetti, 2002] Buzzetti, Dino. ‘Digital Representation and the Text Model,’ New Literary History, 33:1 (2002), 61-87. [Buzzetti and McGann, 2006] Buzzetti, Dino, and Jerome McGann. ‘Critical Editing in a Digital Horizon,’ in Electronic Textual Editing, ed. Lou Burnard, Katherine O’Brien O’Keeffe, and John Unsworth, New York, The Modern Language Association of America, 2006, pp. 51-71. [Buzzetti, forthcoming] Buzzetti, Dino. ‘Digital Editions and Text Processing,’ in Text Editing in a Digital Environment, Proceedings of the AHRC ICT Methods Network Expert Seminar (London, King’s College, 24 March 2006), ed. Marilyn Deegan and Kathryn Sutherland (Digital Research in the Arts and Humanities Series), Aldershot, Ashgate, forthcoming. Markup originates a loop connecting the structure of the text’s expression to the structure of the text’s content. An implementation of the markup loop would considerably enhance the functionality of text representation and processing in a digital edition. To achieve implementation, markup information could be integrated into the object (or datatype) ‘string’ on which an application system operates. Extended strings, as a datatype introduced by Manfred Thaller [1996, 2006], look as a suitable candidate for the implementation of the markup loop. Markup originates a loop connecting the structure of the text’s expression to the structure of the text’s content. An implementation of the markup loop would considerably enhance the functionality of text representation and processing in a digital edition. To achieve implementation, markup information could be integrated into the object (or datatype) ‘string’ on which an application system operates. Extended strings, as a datatype introduced by Manfred Thaller [1996, 2006], look as a suitable candidate for the implementation of the markup loop. [Dubin, 2003] Dubin, David. ‘Object mapping for markup semantics,’ Proceedings of Extreme Markup Languages 2003, Montréal, Québec, 2003: URL = <http://www.idealliance. org/papers/extreme/proceedings//html/2003/Dubin01/ EML2003Dubin01.html> [Dubin and Birnbaum, 2004] Dubin, David, and David J. Birnbaum. ‘Interpretation Beyond Markup,’ Proceedings of Extreme Markup Languages 2004, Montréal, Québec, 2004: URL = <http://www.idealliance.org/papers/extreme/ proceedings//html/2004/Dubin01/EML2004Dubin01.html> [McGann, 1991] McGann, Jerome. The Textual Condition, Princeton, NJ, Princeton University Press, 1991. [McGann, 1999] McGann, Jerome. ‘What Is Text? Position statement,’ ACH-ALLC’99 Conference Proceedings, Charlottesville,VA, University of Virginia, 1999: URL = <http://www.iath.virginia.edu/ach-allc.99/proceedings/hockeyrenear2.html> [Rehbein, forthcoming] Rehbein, Malte. Reconstructing the textual evolution of a medieval manuscript. [Rehbein, unpublished] Rehbein, Malte. Göttinger Burspraken im 15. Jahrhundert. Entstehung – Entwicklung – Edition. PhD thesis, Univ. Göttingen. _____________________________________________________________________________ 80 Digital Humanities 2008 _____________________________________________________________________________ [Renear, 2001] Renear, Allen. ‘The descriptive/procedural distinction is flawed,’ Markup Languages:Theory and Practice, 2:4 (2001), 411–420. [Sperberg-McQueen, Huitfeldt and Renear, 2000] SperbergMcQueen, C. M., Claus Huitfeldt, and Allen H. Renear. ‘Meaning and Interpretation of Markup,’ Markup Languages: Theory and Practice, 2:3 (2000), 215-234. [Thaller,1991 ] Thaller, Manfred. ‘The Historical Workstation Project,’ Computers and the Humanities, 25 (1991), 149-62. [Thaller, 1996] Thaller, Manfred. ‘Text as a Data Type,’ ALLCACH’96: Conference Abstracts, Bergen, University of Bergen, 1996. [Thaller, 2006] Thaller, Manfred. ‘Strings, Texts and Meaning.’ Digital Humanities 2006: Conference Abstracts, Paris, CATI - Université Paris-Sorbonne, 2006, 212-214. Reflecting on a Dual Publication: Henry III Fine Rolls Print and Web Arianna Ciula arianna.ciula@kcl.ac.uk King’s College London, UK Tamara Lopez tamara.lopez@kcl.ac.uk King’s College London, UK A collaboration between the National Archives in the UK, the History and Centre for Computing in the Humanities departments at King’s College London, the Henry III Fine Rolls project (http://www.frh3.org.uk) has produced both a digital and a print edition (the latter in collaboration with publisher Boydell & Brewer) [1] of the primary sources known as the Fine Rolls. This dual undertaking has raised questions about the different presentational formats of the two resources and presented challenges for the historians and digital humanities researchers involved in the project, and, to a certain extent, for the publisher too. This paper will examine how the two resources evolved: the areas in which common presentational choices served both media, and areas in which different presentational choices and production methodologies were necessary. In so doing, this paper aims to build a solid foundation for further research into the reading practices and integrated usage of hybrid scholarly editions like the Henry III Fine Rolls. Presentation as interpretation In Material Culture studies and, in particular, in studies of the book, the presentational format of text is considered to be of fundamental importance for the study of production, social reading and use. Therefore, description of and speculation about the physical organisation of the text is essential of understanding the meaning of the artefact that bears that text. Similarly, in Human Computer Interaction studies and in the Digital Humanities, the presentation of a text is considered to be an integral outgrowth of the data modelling process; a representation of the text but also to some degree an actualisation of the interpretative statements about the text. Indeed, to the eyes of the reader, the presentational features of both a printed book and a digital written object will not only reveal the assumptions and beliefs of its creators, but affect future analysis of the work. Dual publication: digital and print On the practical side, within the Henry III Fine Rolls project, different solutions of formatting for the two media have been negotiated and implemented. _____________________________________________________________________________ 81 Digital Humanities 2008 _____________________________________________________________________________ The print edition mainly represents a careful pruning of the digital material, especially as pertains to the complex structure of the indexes. The indexes were built upon the marked-up fine rolls texts and generated from an ontological framework (Ciula, Spence, Vieira, & Poupeau; 2007). The latter, developed through careful analysis by scholars and digital humanities researchers, constitutes a sort of an a posteriori system that represents familial networks, professional relationships, geo-political structures, thematic clusters of subjects, and in general various types of associations between the 13th century documents (the so called Fine Rolls) and the the roles played by places and people in connection with them. The ontology is used to produce a series of pre-coordinate axes (the indices) that the reader can follow to explore the texts.The flexibility of the ontology allows the texts to be fairly exhaustively indexed, just as the presentational capabilities of the digital medium allow for the display and navigation of indexes that are correspondingly large. By contrast, the print edition had to follow the refined conventions of a well established scholarly tradition in publishing editions in general and calendar [2] editions in particular, both in terms of formatting and, more importantly for us, in terms of content selection/creation and modelling. Though the indices within the printed edition are also precoordinate axes along which to explore the text, the way in which they are produced is perceived to be a nuanced and intuitive aspect of the scholarship and one that revealed itself to be less tolerant to change.This, coupled with the presentational constraints of the printed medium result in indices that present information succinctly and with a minimum of conceptual repetition. Similarly, the first print volume of around 560 pages gives absolute prominence -something that can be stated much more strongly in a linear publication than in a digital one- to a long and detailed historical introduction, followed by a section on the adopted editorial strategies. However, the two artefacts of the project also share many points in common, either because the digital medium had to mirror the tradition of its more authoritative predecessor, or for more practical -nevertheless not to be dismissed- reasons of work-flow and foreseen usage. An interesting example of the latter is the adopted layout of footnotes, where the print format was modelled on the base of the digital layout and, although it was a completely unusual arrangement, was accepted as suitable by the publisher. On the base of the work done so far and on the feedback on the use of the book and the website, the presentational format will be refined further for future print volumes to come and for the additional material to be included in the digital edition before the end of the project. One reading process On the methodological side, we believe that further research into the usage and reading process of these parallel publications could lead towards a better understanding of scholarly needs and therefore a better modelling of such a dual product that is becoming a more and more common deliverable in digital humanities projects. As this paper will exemplify, the presentation of data needs to be tailored to take into account the more or less fine conventions of two different media which have different traditions, different life cycles, different patterns of use and, possibly, different users. However, although very different in nature, these two publications are not necessarily perceived and – more importantly- used as separate resources with rigid boundaries between them. For a scholar interested in the fine rolls, the reading of the edition and the seeking of information related to it (persons, places, subjects and any other interesting clue to its historical study in a broader sense) is a global process that does not stop when the book is closed or the the browser shut. We believe that, when supported by a deep interest in the material, the connection between the two publications is created in a rather fluid manner. The reality of the reading process and information seeking, as it happens, is influenced by the products it investigates, but ultimately has a form of its own that is different from the objects of analysis. It is dynamic and heterogeneous, it leaves on the integration between different types of evidence, no matter what their format is, including other kind of external sources. Indeed, the library or archive is the most likely environment where a scholar of the fine rolls would find herself browsing the print or digital edition, eventually the original primary sources or their digital images, plus any range of secondary sources. Studying the integration of print and digital The data behind the two publications are drawn from the same informational substrate, but are separated to create two presentational artefacts. As established, reading is expected to be the primary activity performed using both and a stated design goal for the project is that the two artefacts will form a rich body of materials with which to conduct historical research.The heterogeneity of the materials, however, suggests that working with texts will of necessity also involve periods of information seeking: moments while reading that give rise to questions which the material at hand cannot answer and the subsequent process embarked upon in order to answer them. Our working hypothesis is that to fill these information gaps (Wilson, 1999), the reader will turn to particular texts in the alternative medium to find answers, moving between the website and the books, fluctuating between states of reading and seeking. _____________________________________________________________________________ 82 Digital Humanities 2008 _____________________________________________________________________________ Thus, the analytical stream in this paper will move from the practices of creating two types of resources to establishing an analytical framework for evaluating their use. Situating the project materials and domain experts within the literature of information behaviour research, we will identify and develop a model for evaluating how well the features of the website and the book support information seeking activities that bridge (Wilson, 1999) reading within the individual media. Conclusions Based on our experience in creating a hybrid edition for the Henry III Fine Rolls project,the challenges and adopted solutions for the two types of published resources are a starting point from which to reflect on the integrated production of a dual object. At the same time, continuing work begun elsewhere in the digital humanities (Buchanan, Cunningham, Blandford, Rimmer, & Warwick; 2006) to adapt methodologies used in Information Science and Book Studies, a rationale and method for the design of an analysis of their use and, in particular, of the interaction between scholars and the website/books can be outlined. Dryburgh, Paul and Beth Hartland eds. Arianna Ciula and José Miguel Vieira tech. Eds. (2007) Calendar of the Fine Rolls of the Reign of Henry III [1216-1248], vol. I: 1216-1224, Woodbridge: Boydell & Brewer. Lavagnino, John (2007). Being Digital, or Analogue, or Neither. Paper presented at Digital Humanities 2007, UrbanaChampaign, 4-8 June, 2007. <http://www.digitalhumanities. org/dh2007/abstracts/xhtml.xq?id=219> McGann, Jerome. (1997) “The Rational of Hypertext.” In Kathryn Sutherland ed. Electronic text: investigations in method and theory. Clarendon Press: Oxford: 19-46. Siemens, Ray, Elaine Toms, Stéfan Sinclair, Geoffrey Rockwell, and Lynne Siemens. “The Humanities Scholar in the Twentyfirst Century: How Research is Done and What Support is Needed.” ALLC/ACH 2004 Conference Abstracts. Göteborg: Göteborg University, 2004. <http://www.hum.gu.se/ allcach2004/AP/html/prop139.html> Wilson, T. (1999) “Models in information behaviour research.” Journal of Documentation 55(3), 249-270. Notes [1] The first volume was published in September 2007 (Dryburgh et al. 2007). [2] Calendar stays here for an an English summary of the Latin records. Bibliography Buzetti, Dino and Jerome McGann (2005) “Critical Editing in a Digital Horizon”. In Burnard, O’Keeffe, and Unsworth eds. Electronic Textual Editing. <http://www.tei-c.org/Activities/ETE/Preview/mcgann.xml>. Buchanan, G.; Cunningham, S.; Blandford, A.; Rimmer, J. & Warwick, C. (2005), ‘Information seeking by humanities scholars’, Research and Advanced Technology for Digital Libraries, 9th European Conference, ECDL, 18--23. Ciula, A. Spence, P.,Vieira, J.M., Poupeau, G. (2007) Expressing Complex Associations in Medieval Historical Documents: The Henry III Fine Rolls Project. Paper presented at Digital Humanities 2007, Urbana-Champaign, 4-8 June, 2007. <http://www.digitalhumanities.org/dh2007/abstracts/xhtml. xq?id=196> Dervin, B. (1983) “An overview of sense-making research: Concepts, methods, and results to date”, International Communication Association Annual Meeting, Dallas,TX, 1983. Drucker, Johanna (2003). “The Virtual Codex from Page Space to E-space.” The Book Arts Web <http://www.philobiblon. com/drucker/> _____________________________________________________________________________ 83 Digital Humanities 2008 _____________________________________________________________________________ Performance as digital text: capturing signals and secret messages in a media-rich experience Jama S. Coartney jcoartney@virginia.edu University of Virginia Library, USA Susan L. Wiesner slywiesner@virginia.edu University of Virginia Library, USA Argument As libraries increasingly undertake digitisation projects, it behooves us to consider the collection/capture, organisation, preservation, and dissemination of all forms of documentation. By implication, then, these forms of documentation go beyond written text, long considered the staple of library collections. While several libraries have funded projects which acknowledge the need to digitise other forms of text, in graphic and audio formats, very few have extended the digital projects to include film, much less performed texts. As more performing arts incorporate born-digital elements, use digital tools to create media-rich performance experiences, and look to the possibility for digital preservation of the performance text, the collection, organisation, preservation, and dissemination of the performance event and its artefacts must be considered. The ARTeFACT project, underway at the Digital Media Lab at the University of Virginia Library, strives to provide a basis for the modeling of a collection of performance texts.As the collected texts document the creative process both prior to and during the performance experience, and, further, as an integral component of the performance text includes the streaming of data signals to generate audio/visual media elements, this paper problematises the capture and preservation of those data signals as artefacts contained in the collection of the media-rich performance event. Premise In a report developed by a working group at the New York Public Library, the following participant thoughts are included: Although digital technologies can incorporate filmic ways of perceiving [the performing arts], that is the tip of the iceberg. It is important for us to anticipate that there are other forms we can use for documentation rather than limiting ourselves to the tradition of a camera in front of the stage. Documentation within a digital environment far exceeds the filmic way of looking at a performance’ (Ashley 2005 NYPL Working Group 4, p.5) How can new technology both support the information we think is valuable, but also put it in a format that the next generation is going to understand and make use of?’ (Mitoma 2005 NYPL Working Group 4, p.6) These quotes, and many others, serve to point out current issues with the inclusion of the documentation of movement-based activities in the library repository. Two important library tasks, those of the organization and dissemination of text, require the development of standards for metadata. This requirement speaks towards the need for enabling content-based searching and dissemination of moving-image collections. However, the work being performed to provide metadata schemes of benefit to moving image collections most often refers to (a) a filmed dramatic event, e.g. a movie, and/or (b) metadata describing the film itself.Very little research has been completed in which the moving image goes beyond a cinematic film, much less is considered as one text within a multi-modal narrative. In an attempt to address these issues, the authors developed the ARTeFACT project in hopes of creating a proof-of-concept in the University of Virginia Library. Not content, however, to study the description of extant, filmic and written texts, the project authors chose to begin with describing during the process of the creation of the media-rich, digital collection, including the description of a live performance event. The decision to document a performance event begged another set of answers to questions of issues involved with the collection of texts in a multiplicity of media formats and the preservation of the artefacts created through a performance event. Adding to this layer of complexity was the additional decision to create not just a multi-media performance, but to create one in which a portion of the media was born-digital during the event itself. Created from signals transmitted from sensor devices (developed and worn by students), the born-digital elements attain a heightened significance in the description of the performance texts. After all, how does one capture the data stream for inclusion in the media-rich digital collection? Methodology The ARTeFACT project Alpha includes six teams of students in an Introductory Engineering Class. Each team was asked to design and build an orthotic device that, when worn, causes the wearer to emulate the challenges of walking with a physical disability (included are stroke ‘drop foot’ and paralysis, CP hypertonia, Ricketts, etc.) During the course of the semester, the student teams captured pre-event process in a variety of digital formats: still and video images of the prototypes from cameras and cell phones, PDF files, CAD drawings, PowerPoint files and video documentation of those presentations. The digital files were collected in a local SAKAI implementation. In addition to these six teams, two other teams were assigned the task of developing wireless measurement devices which, when attached to each orthotic device, measures the impact _____________________________________________________________________________ 84 Digital Humanities 2008 _____________________________________________________________________________ of the orthotic device on the gait of the wearer. The sensor then transmits the measurement data to a computer that feeds the data into a Cycling74’s software application: Max/ MSP/Jitter. Jitter, a program designed to take advantage of data streams, then creates real-time data visualization as output. The resultant audio visual montage plays as the backdrop to a multi-media event that includes student ‘dancers’ performing while wearing the orthotic devices. The sensors generate the data signals from Bluetooth devices (each capable of generating up to eight independent signals) as well as an EMG (electro-myogram) wireless system. At any given time there may be as many as seven signalling devices sending as many as 50 unique data streams into the computer for processing. As the sensor data from the performers arrives into Max/MSP/Jitter, it is routed to various audio and video instruments within the application, processed to generate the data visualization, then sent out of the computer via external monitor and sound ports. The screenshot below displays visual representations of both the input (top spectrogram) and output (bottom: sonogram) of a real-time data stream. The data can be visualized in multiple ways; however, the data stream as we wish to capture it is not a static image, but rather a series of samples of data over time. Conclusion The complexity of capturing a performance in which many of the performance elements themselves are created in real time, processed, and used to generate audio visual feedback is challenging. The inclusion of data elements in the artefact collection begs questions with regard to the means of capturing the data without impacting the performance. So, too, does it require that we question what data to include: Does it make sense to capture the entire data stream or only the elements used at a specific instance in time to generate the performance; what are the implications of developing a sub-system within the main performance that captures this information? When creating a collection based on a performance as digital text, and before any work may be done to validate metadata schemes, we must answer these questions. We must consider how we capture the signals and interpret the secret messages generated as part of the media-rich experience. Sample bibliography Adshead-Lansdale, J. (ed.) 1999, Dancing Texts: intertextuality and interpretation London: Dance Books. Goellner, E. W. & Murphy, J. S. 1995, Bodies of the Text New Brunswick, NJ: Rutgers University Press. Kholief, M., Maly, K. & Shen, S. 2003, ‘Event-Based Retrieval from a Digital Library containing Medical Streams’ in Proceedings of the 2003 Joint Conference on Digital Libraries (Online) Available at http://ieeexplore.ieee.org/Xplore/login. jsp?url=/iel5/8569/27127/01204867.pdf There are several options for capturing these signals, two of which are: writing the data directly to disk as the performance progresses and/or the use of external audio and video mixing boards that in the course of capturing can display the mix. Adding to the complexity, the performance draws on a wide variety of supporting, media rich, source material created during the course of the semester. A subset of this material is extrapolated for use in the final performance. These elements are combined, processed, morphed, and reformatted to fit the genre of the presentation and although they may bear some similarity to the original material, they are not direct derivatives of the source and thus become unique elements in the production. Further, in addition to the capture of data streams, the totality of the performance event must be collected. For this, traditional means of capturing the performance event have been determined to be the simplest of the challenges faced. Therefore, a video camera and a microphone pointed at the stage will suffice to fill this minimum requirement for recording the event. New York Public Library for the Performing Arts, Jerome Robbins Dance Division 2005, Report from Working Group 4 Dance Documentation Needs Analysis Meeting 2, NYPL: New York. Reichert, L. 2007, ‘Intelligent Library System Goes Live as Satellite Data Streams in Real-Time’ (Online) Available at http://www.lockheedmartin.com/news/press_releases/2001/ IntelligentLibrarySystemGoesLiveAsS.html _____________________________________________________________________________ 85 Digital Humanities 2008 _____________________________________________________________________________ Function word analysis and questions of interpretation in early modern tragedy Louisa Connors Louisa.Connors@newcastle.edu.au University of Newcastle, Australia The use of computational methods of stylistic analysis to consider issues of categorization and authorship is now a widely accepted practice.The use of computational techniques to analyze style has been less well accepted by traditional humanists. The assumption that most humanists make about stylistics in general, and about computational stylistics in particular, is that it is “concerned with the formal and linguistic properties of the text as an isolated item in the work” (Clark 2005). There are, however, two other points of emphasis that are brought to bear on a text through a cognitive approach. These are: “that which refers to the points of contact between a text, other texts and their readers/listeners”, and “that which positions the text and the consideration of its formal and psychological elements within a socio-cultural and historical context” (Clark 2005). Urszula Clark (2005) argues that these apparently independent strands of analysis or interpretive practice are an “integrated, indissolvable package” (Clark 2005). Computational stylistics of the kind undertaken in this study attempts to link statistical findings with this integrated indissolvable package; it highlights general trends and features that can be used for comparative purposes and also provides us with evidence of the peculiarities and creative adaptations of an individual user. In this case, the individual user is Elizabeth Cary, the author of the earliest extant original play in English by a woman, The Tragedy of Mariam,The Fair Queen of Jewry (1613). As well as Mariam, the set of texts in the sample includes the other 11 closet tragedies associated with the “Sidney Circle”, and 48 tragedies written for the public stage. All plays in the study were written between 1580 and 1640.1 The only other female authored text in the group, Mary Sidney’s Antonius, is a translation, as is Thomas Kyd’s Cornelia2.Alexander Witherspoon (1924), describes the Sidnean closet tragedies as “strikingly alike, and strikingly unlike any other dramas in English” (179). He attributes this to the extent to which the closet writers draw on the work of French playwright Robert Garnier as a model for their own writing. In contrast to other plays of the period closet tragedies have not attracted much in the way of favourable critical attention. They are, as Jonas Barish (1993) suggests, “odd creatures” (19), and Mariam is described as one of oddest. Mariam, as it turns out, is the closet play that is most like a play written for the public stage, in terms of the use of function words. But this isn’t immediately obvious. Some textual preparation was carried out prior to the analysis. Homographs were not tagged, but contracted forms throughout the texts were expanded so that their constituents appeared as separate words. The plays were divided into 2,000 word segments and tagged texts were then run through a frequency count using Intelligent Archive (IA). A total of 563 two-thousand word segments were analysed, 104 of which were from closet plays, and 459 from plays written for the public stage. A discriminant analysis on the basis of the frequency scores of function words demonstrates that there are significant differences between the two groups of plays.Table 1 shows the classification results for a discriminant analysis using the full set of function words. In this test, 561 of the 563 segments were classified correctly. One segment from each group was misclassified. Thus 99.6% of cross-validated grouped cases were correctly classified on the basis of function words alone. The test also showed that only 38 of the 241 function word variables were needed to successfully discriminate between the groups. Table 1 Classification results for discriminant analysis using the full set of function words in 60 tragedies (15801640) closet/non-closet value correctly assigned Closet/Stage Original Crossvalidated (a) Predicted Group Membership Total Closet Public Closet 103 1 104 Public 1 458 459 % Closet 99.0 1.0 100.0 Public .2 99.8 100.0 Count Closet 103 1 104 Public 1 458 459 Count % Closet 1.0 100.0 Public 99.8 100.0 a Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the functions derived from all cases other than that case. b 99.6% of original grouped cases correctly classified. c 99.6% of cross-validated grouped cases correctly classified. Figure 1. Discriminant scores for correctly identified public and closet play segments from 60 tragedies (1580-1640) in 2000 word segments on the basis of 38 most discriminating function words _____________________________________________________________________________ 86 Digital Humanities 2008 _____________________________________________________________________________ components.The scores that produce Figure 3 are “the sum of the variable counts for each text segment, after each count is multiplied by the appropriate coefficient” (Burrows and Craig 1994 68). High counts on word variables at the western end of Figure 2 and low counts on word variables at the eastern end bring the text segments at the western end of Figure 3 to their positions on the graph. The reverse is true for text segments on the eastern side of Figure 3.We can see that the far western section of Figure 3 is populated predominantly with public play segments, and that the eastern side of the y-axis is populated exclusively with segments from stage plays. It is clear that in the case of these most discriminating variables, the segments from Mariam are the closet segments most intermingled with the segments written for the public stage. Figure 2. Principal component analysis for 60 tragedies (15801640) in 4000 word segments for 54 most discriminating function words selected from 100 most frequently occurring function words - word plot for first two eigenvectors Figure 3. Principal component analysis for 60 tragedies (1580-1640) in 4000 word segments for 54 most discriminating function words selected from 100 most frequently occurring function words A principal component analysis (PCA) gives additional information about the differences and similarities between the two sets. An Independent-samples T-test was used to identify the variables most responsible for the differences.This process picked out 54 variables that were significant at the level of 0.0001 and these 54 variables were used for the PCA. PCA looks to define factors that can be described as most responsible for differences between groups. For this text, the texts were broken into 4000 word segments to ensure that frequencies remained high enough for reliable analysis. This produced 268 segments in total (50 closet segments and 218 public play segments). Figure 2 plots the 54 most discriminating words-types for the first two eigenvectors (based on factor loadings for each variable) and shows which words behave most like or unlike each other in the sample. In Figure 3 the eigenvalues from the component matrix have been multiplied through the standardised frequencies for each of the 4,000 word segments to show which segments behave most like or unlike each other on the first two principal Looking at Figure 2 there is evidence of an “emphasis on direct personal exchange” in western section of the graph. In the opposite section of the graph there is evidence of a more disquisitory style of language that is “less personal”, with “markers of plurality, past time, and a more connected syntax” (Burrows and Craig 1994 70). It may be that the results reflect the kinds of observations that critics have long made about early modern closet tragedy and tragedy written for the public stage, suggesting that “word-counts serve as crude but explicit markers of the subtle stylistic patterns to which we respond when we read well” (Burrows and Craig 1994 70). It may also be the case, however, that we can link these statistical results with more interpretive work. Returning to Mariam, the function words which most distinguish the Mariam segments from the rest of the segments of both closet and public plays, are the auxiliary verbs did (8.4) and had (6.9). In the case of did over 500 of the segments have scores of between -1 and 2. Six of the eight of the Mariam segments are extremely high (the two middle segments of Mariam have fairly average z-scores for did). The lowest score occurs in segment 4, when Mariam is conspicuously absent from the action. A very similar pattern emerges for had. In conventional grammars do and other auxiliaries including be and have are viewed as meaningless morphemes that serve a grammatical purpose. Langacker argues that serving a specifiable grammatical function is not inherently incompatible with being a meaningful element (1987 30). Cognitive linguistics suggests that function word schemas interact with each other to produce what Talmy calls a “dotting” of semantic space (1983 226), and that they “play a basic conceptual structuring role” (Talmy 88 51). In this framework, auxiliaries are viewed as profiling a process – they determine which entity is profiled by a clause and impose a particular construal. Langacker argues, for example, that do always conveys some notion of activity or some kind of volitionality or control on the part of the subject. Have designates a subjectively construed relation of anteriority and current relevance to a temporal reference point (Langacker 1991 239). Talmy argues that have can be understood in terms of force dynamics patterns; it “expresses indirect causation either without an intermediate volitional entity…or… with _____________________________________________________________________________ 87 Digital Humanities 2008 _____________________________________________________________________________ such an entity” (1988 65). Talmy goes further to suggest that the “concepts” of force dynamics are “extended by languages to their semantic treatment of psychological elements and interactions” (1988 69). Langacker, R. W. (1999). Losing control: grammaticization, subjectification, and transparency. In A. Blank & P. Koch (Eds.), Historical Semantics and Cognition (Vol. 13, pp. 147-175). Berlin: Mouton de Gruyter. Bringing the tools of cognitive linguistics to bear on the results of computational analysis of texts can provide a framework that validates the counting of morphemes like did and had. The same framework may also shed light on questions of interpretation. This approach appears to provide a disciplined way of identifying and analyzing the linguistic features that are foregrounded in a text, while supporting their interpretation as part of an integrated, indissolvable package. Talmy, L. (1983). How language structures space. In H. Pick & L. Acredolo (Eds.), Spatial orientation:Theory, research, and application. New York: Plenum Press. Talmy, L. (1988). Force Dynamics in Language and Cognition. Cognitive Science, 12. 49-100. Witherspoon, A. M. (1924: rpt. 1968). The Influence of Robert Garnier on Elizabethan Drama. Notes 1 Thomas Kyd wrote for both the public and the private stage. Kyd’s Cornelia and The Spanish Tragedy are included in the study. 2 John Burrows (2002) explores some of the issues around translation and whether it can be “assumed that poets stamp their stylistic signatures as firmly on translation as their original work” (679). Burrows found that Drydan was able to “conceal his hand”, but in other cases it appeared that a “stylisitic signature”, even in the case of a translation, remained detectable (696). References Barish, J. (1993). Language for the Study: Language for the Stage. In A. L. Magnusson & C. E. McGee (Eds.), The ElizabethanTheatre XII (pp. 19-43). Toronto: P.D. Meany. Burrows, J. (2002). The Englishing of Jevenal: Computational stylistics and translated Texts. Style, 36, 677-750. Burrows, J. F., & Craig, D. H. (1994). Lyrical Drama and the “Turbid Mountebanks”: Styles of Dialogue in Romantic and Renaissance Tragedy. Computers and the Humanities, 28, 63-86. Cooper, M. M. (1998). Implicature and The Taming of the Shrew. In J. Culpeper, M. H. Short & P.Verdonk (Eds.), Exploring the Language of Drama: From Text to Context (pp. 54-66). London: Routledge. Clark, U. (2005). Social cognition and the future of stylistics, or “What is cognitive stylistics and why are people saying such good things about it?!” Paper presented at the PALA 25: Stylistics and Social Cognition. Connors, L. (2006). An Unregulated Woman: A computational stylistic analysis of Elizabeth Cary’s The Tragedy of Mariam, The Faire Queene of Jewry. Literary and Linguistic Computing, 21(Supplementary Issue), 55-66. Sweetser, E. (1990). From Etymology to Pragmatics: Metaphorical and cultural aspects of semantic change. Cambridge: Cambridge University Press. _____________________________________________________________________________ 88 Digital Humanities 2008 _____________________________________________________________________________ Deconstructing Machine Learning: A Challenge for Digital Humanities Charles Cooney cmcooney@diderot.uchicago.edu University of Chicago, USA Russell Horton russ@diderot.uchicago.edu University of Chicago, USA Mark Olsen mark@barkov.uchicago.edu University of Chicago, USA Glenn Roe glenn@diderot.uchicago.edu University of Chicago, USA Robert Voyer rlvoyer@diderot.uchicago.edu University of Chicago, USA Machine learning tools, including document classification and clustering techniques, are particularly promising for digital humanities because they offer the potential of using machines to discover meaningful patterns in large repositories of text. Given the rapidly increasing size and availability of digital libraries, it is clear that machine learning systems will, of necessity, become widely deployed in a variety of specific tasks that aim to make these vast collections intelligible. While effective and powerful, machine learning algorithms and techniques are not a panacea to be adopted and employed uncritically for humanistic research. As we build and begin to use them, we in the digital humanities must focus not only on the performance of specific algorithms and applications, but also on the theoretical and methodological underpinnings of these systems. At the heart of most machine learning tools is some variety of classifier. Classification, however, is a fraught endeavor in our poststructuralist world. Grouping things or ideas into categories is crucial and important, but doing so without understanding and being aware of one’s own assumptions is a dubious activity, not necessarily for moral, but for intellectual reasons. After all, hierarchies and orders of knowledge have been shown to be both historically contingent and reflections of prevailing power structures. Bowker and Star state that “classification systems in general... reflect the conflicting, contradictory motives of the sociotechnical situations that gave rise to them” (64). As machine learning techniques become more widely applied to all forms of electronic text, from the WWW to the emerging global digital library, an awareness of the politics of classification and the ordering of knowledge will become ever more important. We would therefore like to present a paper outlining our concerns about these techniques and their underlying technical/intellectual assumptions based on our experience using them for experimental research. In many ways, machine learning relies on approaches that seem antithetical to humanistic text analysis and reading and to more general poststructuralist sensibilities.The most powerful and effective techniques rely on the abilities of systems to classify documents and parts of documents, often in binary oppositions (spam/not spam, male/female, etc). Features of documents employed in machine learning applications tend to be restricted to small subsets of available words, expressions or other textual attributes. Clustering of documents based on relatively small feature sets into a small and often arbitrary number of groups similarly tends to focus on broad patterns. Lost in all of these operations are the marginal and exceptional, rendered hidden and invisible as it were, in classification schemes and feature selection. Feature set selection is the first necessary step in many text mining tasks. Ian Witten notes that in “many practical situations there are far too many attributes for learning schemes to handle, and some of them -- perhaps the overwhelming majority - are clearly irrelevant or redundant” (286-7). In our work, we routinely reduce the number of features (words, lemmas, bigrams, etc) using a variety of techniques, most frequently by filtering out features which occur in a small subset of documents or instances.This selection process is further required to avoid “overfitting” a learner to the training data. One could build an effective classifier and train it using features that are unique to particular documents, but doing so would limit the general applicability of the tool.Attempting to classify French novels by gender of author while retaining the names of characters (as in Sand’s novel, Conseulo) or other distinctive elements is very effective, but says little about gendered writing in 19th century France (Argamon et. al., Discourse). Indeed, many classification tasks may be successful using a tiny subset of all of the words in a corpus. In examining American and non-American Black Drama, we achieved over 90% accuracy in classifying over nearly 700 plays using a feature set of only 60 surface words (Argamon et. al., Gender, Race). Using a vector space similarity function to detect articles in the Encyclopédie which borrow significantly from the Dictionnaire de Trévoux, we routinely get impressive performance by selecting fewer than 1,000 of the 400,000 unique forms in the two documents (Allen et. al.). The requirement of greatly reductive feature set selection for practical text mining and the ability of the systems to perform effective classifications based on even smaller subsets suggests that there is a significant distance from the texts at which machine learning must operate in order to be effective. Given the reductive nature of the features used in text mining tasks, even the most successful classification task tends to highlight the lowest common denominators, which at best may be of little textual interest and at worst extremely misleading, encouraging stereotypical conclusions. Using a decision tree to classify modern and ancient geography articles in the Encyclopédie, we found “selon” (according to) to be the primary distinction, reflecting citation of ancient _____________________________________________________________________________ 89 Digital Humanities 2008 _____________________________________________________________________________ sources (“selon Pline”). Classification of Black Drama by gender of author and gender of speaker can be very effective (80% or more accuracy), but the features identified by the classifiers may privilege particular stereotypes. The unhappy relationship of Black American men with the criminal justice system or the importance of family matters to women are both certainly themes raised in these plays. Of course, men talk more of wives than women and only women tend to call other women “hussies,” so it is hardly surprising that male and female authors/characters speak of different things in somewhat different ways. However, the operation of classifiers is predicated on detecting patterns of word usage which most distinguish groups and may bring to the forefront literary and linguistic elements which play a relatively minor role in the texts themselves. We have found similar results in other classification tasks, including gender mining in French literary works and Encyclopédie classifications. Machine learning systems are best, in terms of various measures of accuracy, at binomial classification tasks, the dreaded “binary oppositions” of male/female, black/white and so forth, which have been the focus of much critical discussion in the humanities. Given the ability of statistical learners to find very thin slices of difference, it may be that any operation of any binary opposition may be tested and confirmed. If we ask for gender classification, the systems will do just that, return gender classifications. This suggests that certain types of hypothesis testing, particularly in regard to binary classifications, may show a successful result simply based on the framing of the question. It is furthermore unclear as to just what a successful classification means. If we identify gender or race of authors or characters, for example, at a better than 80% rate and generate a list of features most associated with both sides of the opposition, what does this tell us about the failed 20%? Are these errors to be corrected, presumably by improving classifiers or clustering models or should we further investigate these as interesting marginal instances? What may be considered a failure in computer science could be an interesting anomaly in the humanities. Bibliography Allen, Timothy, Stéphane Douard, Charles Cooney, Russell Horton, Robert Morrissey, Mark Olsen, Glenn Roe, and Robert Voyer. “Plundering Philosophers: Identifying Sources of the Encyclopédie using the Vector Space Model” in preparation for Text Technology. Argamon, Shlomo, Russell Horton, Mark Olsen, and Sterling Stuart Stein. “Gender, Race, and Nationality in Black Drama, 1850-2000: Mining Differences in Language Use in Authors and their Characters.” DH07, Urbana-Champaign, Illinois. June 4-8, 2007. Argamon, Shlomo, Jean-Baptiste Goulain, Russell Horton, and Mark Olsen. “Discourse, Power, and Ecriture Féminine: Text Mining Gender Difference in 18th and 19th Century French Literature.” DH07, Urbana-Champaign, Illinois. June 4-8, 2007. Bowker, Geoffrey C. and Susan Leigh Star. Sorting Things Out: Classification and its Consequences. Cambridge, MA: MIT Press, 1999. Introna, L. and H. Nissenbaum. “Shaping the Web: Why the Politics of Search Engines Matters.” The Information Society, 16(3): 1-17, 2000. Witten, Ian H. and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005. Machine learning offers great promise to humanistic textual scholarship and the development of digital libraries. Using systems to sift through the ever increasing amounts of electronic texts to detect meaningful patterns offers the ability to frame new kinds of questions. But these technologies bring with them a set of assumptions and operations that should be subject to careful critical scrutiny. We in the digital humanities must do this critical work, relying on our understanding of epistemology and our technical skills to open the black box and shine light on what is inside. Deconstruction in the digital library should be a reading strategy not only for the texts found therein, but also of the systems being developed to manage, control and make the contents of electronic resources accessible and intelligible. _____________________________________________________________________________ 90 Digital Humanities 2008 _____________________________________________________________________________ Feature Creep: Evaluating Feature Sets for Text Mining Literary Corpora. Charles Cooney cmcooney@diderot.uchicago.edu University of Chicago, USA Russell Horton russ@diderot.uchicago.edu University of Chicago, USA Mark Olsen mark@barkov.uchicago.edu University of Chicago, USA Glenn Roe glenn@diderot.uchicago.edu University of Chicago, USA Robert Voyer rlvoyer@diderot.uchicago.edu University of Chicago, USA Machine learning offers the tantalizing possibility of discovering meaningful patterns across large corpora of literary texts. While classifiers can generate potentially provocative hints which may lead to new critical interpretations, the features sets identified by these approaches tend not to be able to represent the complexity of an entire group of texts and often are too reductive to be intellectually satisfying. This paper describes our current work exploring ways to balance performance of machine learning systems and critical interpretation. Careful consideration of feature set design and selection may provide literary critics the cues that provoke readings of divergent classes of texts. In particular, we are looking at lexical groupings of feature sets that hold the promise to reveal the predominant “idea chunklets” of a corpus. Text data mining and machine learning applications are dependent on the design and selection of feature sets. Feature sets in text mining may be described as structured data extracted or otherwise computed from running or unstructured text in documents which serves as the raw data for a particular classification task. Such features typically include words, lemmas, n-grams, parts of speech, phrases, named entities, or other actionable information computed from the contents of documents and/or associated metadata. Feature set selection is generally required in text mining in order to reduce the dimensionality of the “native feature space”, which can range in the tens or hundreds thousands of words in even modest size corpora [Yang]. Not only do many widely used machine learning approaches become inefficient when using such high dimensional feature spaces, but such extensive feature sets may actually reduce performance of a classifier. [Witten 2005: 286-7] Li and Sun report that “many irrelevant terms have a detrimental effect on categorization accuracy due to overfitting” as well as tasks which “have many relevant but redundant features... also hurt categorization accuracy”. [Li 2007] Feature set design and selection in computer or information science is evaluated primarily in terms of classifier performance or measures of recall and precision in search tasks. But the features identified as most salient to a classification task may be of more interest than performance rates to textual scholars in the humanities as well as other disciplines. In a recent study of American Supreme Court documents, McIntosh [2007] refers to this approach as “Comparative Categorical Feature Analysis”. In our previous work, we have used a variety of classifiers and functions implemented in PhiloMine [1] to examine the most distinctive words in comparisons of wide range of classes, such as author and character genders, time periods, author race and ethnicity, in a number of different text corpora [Argamon et. al. 2007a and 2007b].This work suggests that using different kinds of features -- surface form words compared to bilemmas and bigrams -- allows for similar classifier performance on a selected task, while identifying intellectually distinct types of differences between the selected groups of documents. Argamon, Horton et al. [Argamon 2007a] demonstrated that learners run on a corpus of plays by Black authors successfully classified texts by nationality of author, either American or non-American, at rates ranging between 85% and 92%. The features tend to describe gross literary distinctions between the two discourses reasonably well. Examination of the top 30 features is instructive. American: ya’, momma, gon’, jones, sho, mississippi, dude, hallway, nothin, georgia, yo’, naw, alabama, git, outta, y’, downtown, colored, lawd, mon, punk, whiskey, county, tryin’, runnin’, jive, buddy, gal, gonna, funky Non-American: na, learnt, don, goat, rubbish, eh, chief, elders, compound, custom, rude, blasted, quarrel, chop, wives, professor, goats, pat, corruption, cattle, hmm, priest, hunger, palace, forbid, warriors, princess, gods, abroad, politicians Compared side by side, these lists of terms have a direct intuitive appeal. The American terms suggest a body of plays that deal with the Deep South (the state names), perhaps the migration of African-Americans to northern cities (hallway and downtown), and also contain idiomatic and slang speech (ya’, gon’, git, jive) and the language of racial distinction (colored).The non-American terms reveal, as one might expect, a completely different universe of traditional societies (chief, elders, custom) and life under colonial rule (professor, corruption, politicians).Yet a drawback to these features is that they have a stereotypical feel. Moreover, these lists of single terms reduce the many linguistically complex and varied works in a corpus to a distilled series of terms. While a group of words, in the form of a concordance, can show something quite concrete about a particular author’s oeuvre or an individual play, it is difficult to come to a nuanced understanding of an entire _____________________________________________________________________________ 91 Digital Humanities 2008 _____________________________________________________________________________ corpus through such a list, no matter how long. Intellectually, lists of single terms do not scale up to provide an adequate abstract picture of the concerns and ideas represented in a body of works. Performing the same classification task using bilemmas (bigrams of word lemmas with function words removed) reveals both slightly better performance than surface words (89.6% cross validated) and a rather more specific set of highly ranked features. Running one’s eye down this list is revealing: Notes 1. See http://philologic.uchicago.edu/philomine/. We have largely completed work on PhiloMine2, which allows the user to perform classification tasks using a variety of features and filtering on entire documents or parts of documents. The features include words, lemmas, bigrams, bilemmas, trigrams and trilemmas which can be used in various combinations. References American: yo_mama, white_folk, black_folk, ole_lady, st_louis, uncle_tom, rise_cross, color_folk,front_porch, jim_crow, sing_blue black_male, new_orleans, black_boy, cross_door, black_community, james_brown, [Argamon 2007a] Argamon, Shlomo, Russell Horton, Mark Olsen, and Sterling Stuart Stein. “Gender, Race, and Nationality in Black Drama, 1850-2000: Mining Differences in Language Use in Authors and their Characters.” DH07, Urbana-Champaign, Illinois. June 4-8, 2007. Non-American: palm_wine, market_place, dip_hand, cannot_afford, high_priest, piece_land, join_hand,bring_ water, cock_crow, voice_people, hope_nothing, pour_ libation, own_country, people_land, return_home [Argamon 2007b] Argamon, Shlomo, Jean-Baptiste Goulain, Russell Horton, and Mark Olsen. “Discourse, Power, and Ecriture Féminine: Text Mining Gender Difference in 18th and 19th Century French Literature.” DH07, Urbana-Champaign, Illinois. June 4-8, 2007. American (not present in non-American): color_boy, color_girl, jive_ass, folk_live Here, we see many specific instances of Africian-American experience, community, and locations. Using bigrams instead of bilemmas delivers almost exactly the same classifier performance. However, not all works classified correctly using bilemmas are classified correctly using bigrams. Langston Hughes, The Black Nativity, for example, is correctly identified as American when using bigrams but incorrectly classified when using bilemmas. The most salient bigrams in the classification task are comparable, but not the same as bilemmas. The lemmas of “civil rights” and “human rights” do not appear in the top 200 bilemmas for either American or non-American features, but appear in bigrams, with “civil rights” as the 124th most predictive American feature and “human rights” as 111th among non-American features. As the example of The Black Nativity illustrates, we have found that different feature sets give different results because, of course, using different feature sets means fundamentally changing the lexically based standards the classifier relies on to make its decision. Our tests have shown us that, for the scholar interested in examining feature sets, there is therefore no single, definitive feature set that provides a “best view” of the texts or the ideas in them. We will continue exploring feature set selection on a range of corpora representing different genres and eras, including Black Drama, French Women Writers, and a collection of American poetry. Keeping in mind the need to balance performance and intelligibility, we would like to see which combinations of features work best on poetry, for example, compared to dramatic writing. Text classifiers will always be judged primarily on how well they group similar text objects. Nevertheless, we think they can also be useful as discovery tools, allowing critics to find sets of ideas that are common to particular classes of texts. [Li 2007] Li, Jingyang and Maosong Sun, “Scalable Term Selection for Text Categorization”, Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 774-782. [McIntosh 2007] McIntosh, Wayne, “The Digital Docket: Categorical Feature Analysis and Legal Meme Tracking in the Supreme Court Corpus”, Chicago Colloquium on Digital Humanities and Computer Science, Northwestern University, October 21-22, 2007 [Witten 2005] Witten, Ian H. and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005. Yang,Yiming and Jan Pedersen, “A Comparative Study on Feature Selection in Text Categorization” In D. H. Fisher (ed.), Proceedings of ICML-97, 14th International Conference on Machine Learning, pp. 412-420, Nashville, US. Morgan Kaufmann Publishers, San Francisco, US. _____________________________________________________________________________ 92 Digital Humanities 2008 _____________________________________________________________________________ Hidden Roads and Twisted Paths: Intertextual Discovery using Clusters, Classifications, and Similarities Charles Cooney cmcooney@diderot.uchicago.edu University of Chicago, USA Russell Horton russ@diderot.uchicago.edu University of Chicago, USA Mark Olsen mark@barkov.uchicago.edu University of Chicago, USA Glenn Roe glenn@diderot.uchicago.edu University of Chicago, USA Robert Voyer rlvoyer@diderot.uchicago.edu University of Chicago, USA While information retrieval (IR) and text analysis in the humanities may share many common algorithms and technologies, they diverge markedly in their primary objects of analysis, use of results, and objectives. IR is designed to find documents containing textual information bearing on a specified subject, frequently by constructing abstract representations of documents or “distancing” the reader from texts. Humanistic text analysis, on the other hand, is aimed primarily at enhancing understanding of textual information as a complement of “close reading” of relatively short passages. Interpreting a poem, a novel or a philosophical treatise in the context of large digital corpora is, we would argue, a constant interplay between the estranging technologies of machine learning and the direct reading of passages in a work. A significant element of interpreting or understanding a passage in a primary text is based on linking its specific elements to parts of other works, often within a particular temporal framework. To paraphrase Harold Bloom, ‘understanding’ in humanistic textual scholarship, ‘is the art of knowing the hidden roads that go from text to text’. [1] Indeed, situating a passage of a text in a wider context of an author’s oeuvre, time period, or even larger intellectual tradition, is one of the hallmarks of textual hermeneutics. While finding the “hidden roads and twisted paths” between texts has always been subject to the limitations of human reading and recollection, machine learning and text mining offer the tantalizing prospect of making that search easier. Computers can sift through ever-growing collections of primary documents to help readers find meaningful patterns, guiding research and mitigating the frailties of human memory. We believe that a combination of supervised and unsupervised machine learning approaches can be integrated to propose various kinds of passages of potential interest based on the passage a reader is examining at a given moment, overcoming some limitations of traditional IR tools. One-dimensional measures of similarity, such as the single numerical score generated by a vector space model, fail to account for the diverse ways texts interact. Traditional ‘ranked relevancy retrieval’ models assume a single measure of relevance that can be expressed as an ordered list of documents, whereas real textual objects are composed of smaller divisions that each are relevant to other text objects in various complex ways. Our goal is to utilize the many machine learning tools available to create more sophisticated models of intertextual relation than the monolithic notion of “similarity.” We plan to adapt various ideas from information retrieval and machine learning for our use. Measures of document similarity form the basis of modern information retrieval systems, which use a variety of techniques to compare input search strings to document instances. In 1995, Singhal and Salton[2] proposed that vector space similarity measures may also help to identify related documents without human intervention such as humanembedded hypertext links. Work by James Allan[3] suggests the possibility of categorizing automatically generated links using innate characteristics of the text objects, for example by asserting the asymmetric relation “summary and expansion” based on the relative sizes of objects judged to be similar. Because we will operate not with a single similarity metric but with the results of multiple classifiers and clusterers, we can expand on this technique, categorizing intertextual links by feeding all of our data into a final voting mechanism such as a decision tree. Our experiments with intertextual discovery began with using vector space calculations to try to identify that most direct of intertextual relationships, plagiarism or borrowing. Using the interactive vector space function in PhiloMine[4], we compared the 77,000 articles of the 18th century Encyclopédie of Diderot and d’Alembert to the 77,000 entries in a reference work contemporary to it, the Dictionnaire universel françois et latin, published by the Jesuits in the small town of Trévoux.The Jesuit defenders of the Dictionnaire de Trévoux, as it was popularly known, loudly accused the Encyclopédists of extensive plagiarism, a charge Diderot vigorously refuted, but which has never been systematically investigated by scholars. Our procedure is to compare all articles beginning with a specific letter in the Encyclopédie to all Trévoux articles beginning with the same letter. For each article in the Encyclopédie, the system displays each Trévoux article that scores above a user-designated similarity threshold. Human readers manually inspect possible matches, noting those that were probably plagiarized. Having completed 18 of 26 letters, we have found more than 2,000 articles (over 5% of those consulted) in the Encyclopédie were “borrowed” from the Dictionnaire de Trévoux, a surprisingly large proportion given the well-known antagonism between the Jesuits and Encyclopédists.[5] _____________________________________________________________________________ 93 Digital Humanities 2008 _____________________________________________________________________________ The Encyclopédie experiment has shown us strengths and weaknesses of the vector space model on one specific kind of textual relationship, borrowing, and has spurred us to devise an expanded approach to find additional types of intertextual links. Vector space proves to be very effective at finding textual similarities indicative of borrowing, even in cases where significant differences occur between passages. However, vector space matching is not as effective at finding articles borrowed from the Trévoux that became parts of larger Encyclopédie articles, suggesting that we might profit from shrinking the size of our comparison objects, with paragraph-level objects being one obvious choice. In manually sifting through proposed borrowings, we also noticed articles that weren’t linked by direct borrowing, but in other ways such as shared topic, tangential topic, expansion on an idea, differing take on the same theme, etc. We believe that some of these qualities may be captured by other machine learning techniques. Experiments with document clustering using packages such as CLUTO have shown promise in identifying text objects of similar topic, and we have had success using naive Bayesian classifiers to label texts by topic, authorial style and time period. Different feature sets also offer different insights, with part-of-speech tagging reducing features to a bare, structural minimum and N-gram features providing a more semantic perspective. Using clustering and classifiers operating on a variety of featuresets should improve the quality of proposed intertextual links as well as a providing a way to assign different types of relationships, rather than simply labeling two text objects as broadly similar. To test our hypothesis, we will conduct experiments linking Encyclopédie articles to running text in other contemporaneous French literature and reference materials using the various techniques we have described, with an emphasis on intelligently synthesizing the results of various machine learning techniques to validate and characterize proposed linkages. We will create vector representations of surface form, lemma, and ngram feature sets for the Encyclopédie and the object texts as a pre-processing step before subjecting the data to clustering and categorization of several varieties. Models trained on the Encyclopédie will be used to classify and cluster running text, so that for each segment of text we will have a number of classifications and scores that show how related it is to various Encyclopédie articles and classes of articles. A decision tree will be trained to take into account all of the classifications and relatedness measures we have available, along with innate characteristics of each text object such as length, and determine whether a link should exist between two give text objects, and if so what kind of link. We believe a decision tree model is a good choice because such models excel at generating transparent classification procedures from low dimensionality data. complex relationships of various kinds, far more nuanced than the reductionist concept of “similarity” that IR has generally adopted. Fortunately, we have a wide variety of machine learning tools at our disposal which can quantify different kinds of relatedness. By taking a broad view of all these measures, while looking narrowly at smaller segments of texts such as paragraphs, we endeavor to design a system that can propose specific kinds of lower-level intertextual relationships that more accurately reflect the richness and complexity of humanities texts. This kind of tool is necessary to aid the scholar in bridging the gap between the distant view required to manipulate our massive modern text repositories, and the traditional close, contextual reading that forms the backbone of humanistic textual study. Notes 1. “Criticism is the art of knowing the hidden roads that go from poem to poem”, Harold Bloom, “Interchapter: A Manifesto for Antithetical Criticism” in, The Anxiety of Influence; A Theory of Poetry, (Oxford University Press, New York, 1973) 2. Singhal, A. and Salton, G. “Automatic Text Browsing Using Vector Space Model” in Proceedings of the Dual-UseTechnologies and Applications Conference, May 1995, 318-324. 3. Allan, James. “Automatic Hypertext Linking” in Proc. 7th ACM Conference on Hypertext, Washington DC, 1996, 42-52. 4. PhiloMine is the text mining extensions to PhiloLogic which the ARTFL Project released in Spring 2007. Documentation, source code, and many examples are available at http://philologic.uchicago.edu/ philomine/ A word on current work (bigrams, better normalization, etc). 5. An article describing this work is in preparation for submission to Text Technology. The toolbox that we have inherited or appropriated from information retrieval needs to be extended to address humanistic issues of intertextuality that are irreducible to single numerical scores or ranked lists of documents. Humanists know that texts, and parts of texts, participate in _____________________________________________________________________________ 94 Digital Humanities 2008 _____________________________________________________________________________ A novel way for the comparative analysis of adaptations based on vocabulary rich text segments: the assessment of Dan Brown’s The Da Vinci Code and its translations Maria Csernoch mariacsernoch@hotmail.com University of Debrecen, Hungary In this work the appearance of the newly introduced words of The Da Vinci Code and its different translations have been analyzed and compared.The concept “newly introduced words” refers to the words whose first appearance is detected at a certain point of the text. In general the number of the newly introduced words follows a monotonic decay, however, there are segments of the texts where this descending is reversed and a sudden increase is detectable (Csernoch, 20066a, 2006b, 2007) – these text slices are referred to as vocabulary rich text slices. The question arises whether the detectable, unexpectedly high number of newly introduced words of the original work is traceable in the different translations of the text or not. Before advancing on the project let us define the concept of translation. In this context the definition of Hatim and Mason (1993) is accepted, that is any adaptation of a literary work is considered as translation. Beyond the foreign language translations of The Da Vinci Code two more adaptations, the lemmatized and the condensed versions of the original work were analyzed. The comparison of the newly introduced word types and lemmas of the different translations allows us to trace how precisely translation(s) follow(s) the changes in the vocabulary of the original text. Whether the vocabulary rich sections of the original text appear similarly rich in the translations, or the translations swallow them up, or the translations are richer at certain sections than the original work. In addition, could the changes in the newly introduced words, as an additional parameter to those already applied, be used to give a hint of the quality of a translation? To carry out the analysis a previously introduced method was applied (Csernoch, 2006a, 2006b). The text was divided into constant-length intervals, blocks. The number of the newly introduced words was mapped to each block. Those text segments were considered vocabulary rich, in which the number of the newly introduced words significantly exceeds that predicted by a first-order statistical model. The original The Da Vinci Code In the original The Da Vinci Code eleven significant, vocabulary rich, text slices were detected. This number is similar to what was found in other, previously analyzed works (Csernoch, 2007). However, a further analysis of these text slices was applied to reveal their nature. Four parameters, – the distribution of these text slices within the text, – their length – the number of blocks in the significant text slice, – their intensity – the local maximum of the relative number of the newly introduced words, and – their content were determined. The majority of the significant text slices of The Da Vinci Code both in lengths and intensity turned out to be unusually small, containing descriptions of events, characters and places. None of them stands for stylistic changes which usually (Baayen, 2001) trigger extremely vocabulary rich text slices. The distribution of them is uneven, they mainly appear in the first half of the text. This means that the second half of the novel hardly introduces any more vocabulary items that predicted by a first-order statistical model. The analysis of the lemmatized texts To see whether the word types, with all their suffixes, are responsible for any losses of vocabulary rich text slices, the lemmatization, as an adaptation of the text, was carried out. As it was found earlier (Csernoch, 2006b), the lemmatization of the text produced minor, if any, differences in the analysis of the word types in an English text. To the English nonlemmatized and lemmatized The Da Vinci Code their Hungarian correspondences were compared. Unlike the English texts, at the beginning of the Hungarian texts the number of newly introduced word types was so high that up to the first two hundred blocks (20,000 tokens), some of the text slices with significantly high number of newly introduced word types might be swallowed up. The foreign language translations of The Da Vinci Code While the absolute numbers of the newly introduced words of texts in different languages cannot, their relative numbers can be compared using the method outlined in Csernoch (2006a). Relying on the advantages of the method three different translations were compared to the original text, the Hungarian, the German, and the French. If the vocabulary rich text slices are found at the same positions both in the original and in the translated text, the translation can then be considered as exact in respect of vocabulary richness. _____________________________________________________________________________ 95 Digital Humanities 2008 _____________________________________________________________________________ In the Hungarian translation the lengths and the intensities of the vocabulary rich text slices were not altered, however their distribution was more even than in the English text, and their number increased to sixteen. While the vocabulary rich text slices of the English text were all found in the Hungarian text, further such text slices were identified in the second half of the Hungarian text. The comparison revealed that the Hungarian text is richer in vocabulary than the English text. The German translation replicated the vocabulary rich text slices of the original English text, and provided five more, which, similarly to the Hungarian translation, means that these text slices are richer in vocabulary than the corresponding text slices in the English text. to one-third, which is a notable difference compared to the full-length Hungarian text. Both in the condensed English and Hungarian texts the vocabulary rich segments were concentrated to the first half of the texts representing the same events, none of the segments unique to the full-length Hungarian text appeared in the condensed Hungarian text.The cumulative length of the vocabulary rich text slices dropped to 51% in the English and to 43% in the Hungarian text. Again, the Hungarian text seemed to be damaged to a greater extent. All the analyzed parameters thus clearly indicate that for the condensed Hungarian version the condensed English text was the direct source. Bibliography In the French translation only nine significant text slices were found. All of these text slices were only one block long, and their intensities were also surprisingly small. This means that the distribution of the newly introduced words in the French translation hardly differs from that predicted by the model. Furthermore, they are quite different in content from the vocabulary rich text slices in the other languages. Thus, there are hardly any concentrated, vocabulary rich text slices in the French text. Baayen, R. H. (2001) Word Frequency Distributions. Kluwer Academic Publishers, Dordrecht, Netherlands The condensed versions of The Da Vinci Code Csernoch, M. (2006b) Frequency-based Dynamic Models for the Analysis of English and Hungarian Literary Works and Coursebooks for English as a Second Language. Teaching Mathematics and Computer Science. Debrecen, Hungary Finally, the condensed versions of the English and Hungarian texts were compared to the corresponding full-length text and to each other. Several questions are to be answered in this context. By the nature of this adaptation it is obvious that the length of the text is curtailed to some extent. However, this parameter does not tell much about the nature of the condensation. We do not know from this parameter whether the condensation is only a cropping of the original text – certain text segments are left out while others are untouched – or the whole text is modified to some extent. If the text is modified, the percentage of the remaining tokens, word types, lemmas, and hapax legomena are parameters which tell us more about the condensation. To get a further insight it is worth considering how the vocabulary rich text segments of the original text are transferred into the condensed text. This last parameter might be a great help in deciding from which first order adaptation a second order adaptation of a text – in this case the condensed Hungarian text – is derived. Csernoch, M. (2006a) The introduction of word types and lemmas in novels, short stories and their translations. http:// www.allc-ach2006.colloques.paris-sorbone.fr/DHs.pdf. Digital Humanities 2006. The First International Conference of the Alliance of Digital Humanities Organisations. (5-9 July 2006, Paris) Csernoch, M. (2007) Seasonalities in the Introduction of Word-types in Literary Works. Publicationes Universitatis Miskolcinensis, Sectio Philosophica, Tomus XI. – Fasciculus 3. Miskolc 2006-2007. Hatim, B. and Mason, I. (1993) Discourse and the Translator. Longman Inc., New York. Both the English and the Hungarian condensed texts are 45% of the original texts in length. The number of word types is 64 and 55%, the number of lemmas are 64 and 61%, while the number of hapax legomena is 70 and 56% of the English and Hungarian full-length texts, respectively.These parameters indicate that the Hungarian condensed text bore more serious damage than the English did. The number of vocabulary rich text segments dropped to six – somewhat more than the half of original number – in the English text. On the other hand, the number of these text segments in the Hungarian text dropped _____________________________________________________________________________ 96 Digital Humanities 2008 _____________________________________________________________________________ Converting St Paul: A new TEI P5 edition of The Conversion of St Paul using stand-off linking James C. Cummings James.Cummings@oucs.ox.ac.uk University of Oxford, UK In researching the textual phenomena and scribal practices of late-medieval drama I have been creating an electronic edition of The Conversion of St Paul. This late-medieval verse play survives only in Bodleian MS Digby 133. My edition attempts to be a useful scholarly work, which where possible leverages existing resources to create an agile interoperable resource. In undertaking this work I have been able to explore a number of issues related to the creation of such editions which may have pedagogic benefit as a case study to others.These include shortcuts to speed up the creation of the initial edition, the generation of additional resources based solely on a single XML encoded text, the use of new mechanisms in the TEI P5 Guidelines, and the exploitation of stand-off markup to experiment with seamlessly incorporating external resources. The TEI P5 Guidelines have been followed in production of this edition, and specifically I have used a number of features which are new additions to TEI P5 and which others may not yet be familiar. It was developed in tandem with early drafts of the Guidelines, in part to test some of the features we were adding. These include: • A customised view of the TEI expressed as a TEI ODD file. This allows generation not only of constrained TEI Schemas and DTDs but also project specific documentation through the TEI’s ODD processor ‘Roma’. • A manuscript description of MS Digby 133, and the manuscript item of this play, using the TEI’s new module for manuscript description metadata. • Consistent and in-depth use of the new <choice> structure to provide alternative textual information at individual points in the text. Specifically, in this highlyabbreviated medieval text, this has been used to provide both abbreviations and expansions which then can be toggled in the rendered version. Regularisation of medieval spelling has been handled with stand-off markup, but could equally have been incorporated into <choice> • Inside abbreviations and expansions, the new elements <am> (abbreviation marker) and <ex> (expanded text) have been used. This allows the marking and subsequent display of abbreviation marks in a diplomatic edition view and italicised rendering of the supplied text in the expanded view. • The edition also records information about the digital photographic surrogate provided by the Bodleian using the new <facsimile> element. While this will also allow linking from the text to particular zones of the images, this has not yet been undertaken.. • The edition also uses various new URI-based pointing mechanisms and datatype constraints new to TEI P5. To speed up the creation of the edition I took an out of copyright printed version of the text (Furnivall 1896) which was scanned and passed through optical character recognition. This was then carefully corrected and proofread letter-byletter against freely available (though restrictively licensed) images made available online by the Bodleian Library, Oxford (at http://image.ox.ac.uk/show?collection=bodleian&manu script=msdigby133). I used OpenOffice to create the initial edition, with specialised formatting used to indicate various textual and editorial phenomena such as expanded material, superscript abbreviation markers, stage directions, and notes. The up-scaling of the markup through using this presentational markup was achieved when I converted it, using XSLT, to very basic TEI XML. While this is a quick method of data entry familiar to many projects, it only tends to work successfully in non-collaborative projects where the consistency of the application of formatting can be more easily controlled as any inconsistencies can lead to significant manual correction of the generated XML. Another of the issues I was interested in exploring in this edition was the use of stand-off markup to create interoperable secondary resources and how this might effect the nature of scholarly editing.While I could have stored much of this information I needed in the edition itself, I wanted to experiment with storing it in external files and linking them together by pointing into the digital objects.The motivation for this comes from a desire to explore notions of interoperability since stand-off markup methodology usually leaves the base text untouched and stores additional information in separate files.As greater numbers of good scholarly academic resources increasingly become available in XML, the pointing into a number of resources, and combining these together to form an additional greater resource is becoming more common. Stand-off markup was used here partly to experiment with the idea of creating an interoperable flexible resource, that is an ‘agile edition’. For example, an edition can be combined with associated images, a glossary or word list, or other external resources such as dictionaries. In the case of this edition, I generated a word list (encoded using the TEI dictionaries module) using XSLT. The word list included any distinct orthographic variants in the edition.This was based on a ‘deepequals’ comparison which compared not only the spelling of words, but all of their descendant elements, and thus captured differences in abbreviation/expansion inside individual words. The location of individual instances of orthographic variance in the edition could easily have been stored along with the entry in the word list. However, since part of the point was to experiment with handling stand-off markup, I stored these in a _____________________________________________________________________________ 97 Digital Humanities 2008 _____________________________________________________________________________ third file whose only purpose was to record <link> elements pointing both to an entry in the word list and every single instance of this word in the edition.This linking was done using automatically-generated xml:id attributes on each word and word list entry. This enables a number of usability features. The clicking on any individual word in the edition takes you to its corresponding entry in the word list. From any entry in the word list you can similarly get back to any other individual instance of that word in the edition. Moreover the word list entry also contains an optionally displayed concordance of that word to allow easy comparison of its use in context. In addition to using resources created by myself, it was a desired aim of this investigation into stand-off markup to use external resources. The most appropriate freely-available resource in this case is the Middle English Dictionary (MED), created by the University of Michigan. As this scholarly edition was being created in my spare time, I did not want to exhaustively check orthographic words in my edition against the MED and link directly to the correct senses. While that is certainly possible, and should be the recommended text-critical practice, it would take a significant amount of time and be prone to error. Instead I desired to pass a regularised form of the word to the MED headword search engine, and retrieve the results and incorporate them dynamically into the display of the entry for that word in the word list. However, this proved impossible to do from the MED website because their output, despite claiming to be XHTML, was not well-formed. Luckily, they were willing to supply me with an underlying XML file which provided not only headwords, but also their various different orthographic forms and the MED id number to which I could link directly. Thus, I was able to achieve the same effect as transcluding the MED search results by reimplementing the functionality of their search directly in my XSLT and thus providing pre-generated links in each entry to possible headwords in the MED. While successful for my resource, in terms of true interoperability it is really a failure, one which helps to highlight some of the problems encountered when pointing into resources over which you have no control. The proposed paper will describe the process of creation of the edition, the benefits and drawbacks of using standoff markup in this manner, its linking to external resources, and how the same processes might be used in either legacy data migration or the creation of new editions. One of the concluding arguments of the paper is that the advent of new technologies which make the longpromised ability for the interoperability of resources that much easier, also encourages (and is dependent upon) us making our own existing materials accessible in a compatible manner. Bibliography Baker, Donald C., John L. Murphy, and Louis B. Hall, Jr. (eds) Late Medieval Religious Plays of Bodleian MSS Digby 133 and E Museo 160, EETS, 283 (Oxford: Oxford Univ. Press, 1982) Eggert, Paul. ‘Text-encoding, theories of the text and the “work-site”’. Literary and Linguistic Computing, (2005), 20:4, 425-435 Furnivall, F. J. (ed.) The Digby Plays, (New Shakespeare Society Publications, 1882) Re-issued for EETS Extra Series LXX, (London: EETS, 1896) Robinson, P. ‘The one and the many text’, Literary and Linguistic Computing, (2000), 15:1, 5-14. Robinson, P. ‘Where We Are with Electronic Scholarly Editions, and Where We Want to Be’, Jahrbuch für Computerphilologie, 5 (2003), 123-143. Robinson, P. ‘Current issues in making digital editions of medieval texts or, do electronic scholarly editions have a future?’, Digital Medievalist, (2005), 1:1, Retrieved 1 Nov. 2007 <http://www.digitalmedievalist.org/journal/1.1/robinson/> Sperberg-McQueen, C. M. Textual Criticism and the Text Encoding Initiative. Annual Convention of the Modern Language Association, 1994. reprinted in Finneran, R. J. (Ed.) (1996) The Literary Text in the Digital Age. Ann Arbor: University of Michigan Press. 37-62 TEI Consortium, eds. TEI P5: Guidelines for Electronic Text Encoding and Interchange. Retrieved 1 Nov. 2007 <http://www. tei-c.org/release/doc/tei-p5-doc/en/html/index.html> Unsworth, J. Electronic Textual Editing and the TEI. Annual Convention of the Modern Language Association, 2002. Retrieved 1 Nov. 2007. <http://www3.isrl.uiuc.edu/ ~unsworth/mla-cse.2002.html>. Vanhoutte, E. ‘Prose fiction and modern manuscripts: limitations and possibilities of text-encoding for electronic editions’. In J. Unsworth, K. O’Keefe, and L. Burnard (Eds). Electronic Textual Editing. (New York: Modern Language Association of America, 2006), p. 161-180. _____________________________________________________________________________ 98 Digital Humanities 2008 _____________________________________________________________________________ ENRICHing Manuscript Descriptions with TEI P5 James C. Cummings James.Cummings@oucs.ox.ac.uk University of Oxford, UK ENRICH is a pan-European eContent+ project led by the Czech National Library, starting 1 December 2007, which is creating a base for the European digital library of cultural heritage (manuscript, incunabula, early printed books, and archival papers) by the integration of existing but scattered electronic content within the Manuscriptorium digital library through the use of the metadata enrichment and coordination between heterogeneous metadata and data standards. The consortium brings together critical mass of content, because the project groups together the three richest owners of digitized manuscripts among national libraries in Europe (Czech Republic, Iceland and Serbia as associated partners); ENRICH partner libraries possess almost 85% currently digitized manuscripts in the national libraries in Europe, which will be enhanced by substantial amount of data from university libraries and other types of institutions in terms of several hundreds of thousands of digitized pages plus hundreds of thousands of pages digitized by the partners through the European Library data collections. When the project has finished the consortium will make available more than an estimated 5,076,000 digitized pages. ENRICH <http://enrich.manuscriptorium.com/> builds upon the existing Manuscriptorium platform <http://www. manuscriptorium.com/> and is adapting it to the needs of those organizations holding repositories of manuscripts. The principle of integration is centralisation of the metadata (descriptive evidence records) within the Manuscriptorium digital library and distribution of data (other connected digital documents) among other resources within the virtual net environment. That is, the metadata of the manuscript descriptions is aggregated in a single place and format to assist with searching these disparate repositories, but the images and original metadata records remain with the resource-holding institutions. ENRICH target groups are content owners/holders, Libraries, museums and archives, researchers & students, policy makers and general interest users. The project allows them to search and access documents which would otherwise be hardly accessible by providing free access to almost all digitized manuscripts in Europe. Besides images it also offers access to TEI-structured historical full texts, research resources, other types of illustrative data (audio and video files) or large images of historical maps. The ENRICH consortium is closely cooperating with TEL (The European Library) and will become a component part of the European Digital Library when this becomes reality. Institutions involved in the ENRICH project include: • National Library of Czech Republic (Czech Republic) • Cross Czech a.s. (Czech Republic) • AiP Beroun s r.o. (Czech Republic) • Oxford University Computing Services (United Kingdom) • Københavns Universitet - Nordisk Foskningsinstitut (Denmark) • Biblioteca Nazionale Centrale di Firenze National Library in Florence (Italy) • Università degli Studi di Firenze - Centro per la comunicazione e l integrazione dei media Media Integration and Communicaiton Centre Firenze(Italy) • Institute of mathematics and informatics (Lithuania) • University Library Vilnius (Lithuania) • SYSTRAN S.A. (France) • University Library Wroclaw (Poland) • Stofnun Árna Magnússonar í íslenskum fræðum (Iceland) • Computer Science for the Humanities - Universität zu Köln (Germany) • St. Pölten Diocese Archive (Austria) • The National and University Library of Iceland (Iceland) • Biblioteca Nacional de Espana - The National Library of Spain (Spain) • The Budapest University of Technology and Economics (Hungary) • Poznan Supercomputing and Networking Center (Poland) Manuscriptorium is currently searchable via OAI-PMH from the TEL portal, this means that any ENRICH full or associated partner automatically enriches the European Digital Library. The main quality of ENRICH and Manuscriptorium is the application of homogeneous and seamless access to widespread resources including access to images from the distributed environment under a single common interface. ENRICH supports several levels of communication with remote digital resources, ranging from OAI harvesting of partner libraries to full integration of their resources into Manuscriptorium. Furthermore, ENRICH has developed free tools to assist in producing XML structured digitized historical documents, and these are already available and ready to use. However, the existing XML schema reflects the results of the now-dated EU MASTER project for TEI-based manuscript descriptions. It also incorporates other modern content standards, especially in the imaging area. It is this schema that is being updated to reflect the developments in manuscript description available in TEI P5. The internal Manuscriptorium format is based on METS containerization of the schema and related parallel descriptions which enables a flexible approach needed by the disparate practices of researchers in this field. The Research Technology Services section of the Oxford University Computing Services is leading the crucial workpackage on the standardization of shared metadata. In addition to a general introduction to the project it is the progress of this work-package which the proposed paper will discuss. This work-package is creating a formal TEI specification, based on the TEI P5 Guidelines, for the schema to be used to describe _____________________________________________________________________________ 99 Digital Humanities 2008 _____________________________________________________________________________ manuscripts managed within Manuscriptorium. The use of this specification enables automatic generation of reference documentation in different languages and the creation of a formal DTD or Schemas as well as formal DTD or Schemas. A suite of tools is also being developed to convert automatically existing sets of manuscript descriptions, where this is feasible, and to provide simple methods of making them conformant to the new standard where it is not. These tools are being developed and validated against the large existing base of adopters of the Master standard and will be distributed free of charge by the TEI. The proposed paper will report on the development and validation of a TEI conformant specification for the existing Manuscriptorium schema using the TEI P5 specification language (ODD). This involved a detailed examination of the current schema and documentation developed for the existing Manuscriptorium repository and its replacement by a formal TEI specification. This specification will continue to be further enhanced in light of the needs identified by project participants and the wider MASTER community to form the basis of a new schema and documentation suite.The paper will report on the project’s production of translations for the documentation in at least English, French, Czech, German, as well as schemas to implement it both as DTD and RELAXNG. The production of these schemas and the translated documentation are produced automatically from the TEI ODD file. A meeting of representatives from other European institutions who have previously used the MASTER schema has been organized where the differences in how they have used the schema will have been explored, along with their current practice for manuscript description.The aim is to validate both the coverage of the new specification and the feasibility and ease of automatic conversion towards it.The outputs from this activity will include a report on any significant divergence of practice amongst the sample data sets investigated. The ODD specification will continue to be revised as necessary based on the knowledge gained from the consultation with other MASTER users. This will help to create an enriched version of the preliminary specifications produced. Finally, software tools are being developed to assist in conversion of sets of records produced for earlier MASTER specifications, and perhaps some others, to the new TEI P5 conformant schema. These tools are being tested against the collection of datasets gained from the participants of the meeting with other MASTER users, but also more widely within the TEI community. OUCS is also preparing tutorial material and discussion papers on the best practice to assist other institutions with migration existing MASTER material to the new standard. In this subtask ENRICH is cooperating with a broader TEI-managed effort towards the creation of TEI P5 migration documentation and resources. to perform this harvesting are also being developed. Eventually, the internal environment of Manuscriptorium will be enhanced through implementation of METS containerization of the Manuscriptorium Scheme. This will involve an assessment of the respective roles of the TEI Header for manuscript description and of a METSconformant resource description and will enable different kinds of access to the resources within the Manuscriptorium. This will help to demonstrate for others the interoperability of these two important standards, and in particular where their facilities are complementary. Improvement and generalization of Unicode treatment in Manuscriptorium is the final part of the OUCS-led work package. As Manuscriptorium is basically an XML system, all the data managed is necessarily represented in Unicode.This could cause problems for materials using non-standard character encodings, for example where manuscript descriptions quote from ancient scripts and include glyphs not yet part of Unicode. The TEI recommendations for the representation of nonstandard scripts are being used within ENRICH project which is producing a suite of non-standard character and glyph descriptions appropriate to the project’s needs. The proposed paper is intended as a report on the work done in the conversion and rationalization of manuscript metadata across a large number of archives with disparate practices. While it will introduce the project to delegates at Digital Humanities 2008, it will concentrate on reporting the the problems and successes encountered in the course of these aspects of project. Although the overall project will not be finished by the time of the conference, the majority of the work in developing a suite of conversion tools will be complete by this time and the paper will focus on this work. As such, although it will detail work done by the author, it will rely on work by project partners who although not listed as authors here will be briefly acknowledged in the paper where appropriate. An OAI-PMH harvester is being implemented and incorporated into Manuscriptorium. The first step is to ensure that OAI/ PMH metadata is available for harvesting from all the resources managed within Manuscriptorium. Appropriate software tools _____________________________________________________________________________ 100 Digital Humanities 2008 _____________________________________________________________________________ Editio ex machina - Digital Scholarly Editions out of the Box Alexander Czmiel alexander@czmiel.de Berlin-Brandenburg Academy of Sciences and Humanities , Germany For the most part, digital scholarly editions have been historically grown constructs. In most cases, they are oriented toward print editions, especially if they are retro-digitized. But even “digital-born” editions often follow the same conventions as printed books. A further problem is that editors start to work on a scholarly edition without enough previous in-depth analysis about how to structure information for electronic research. In the course of editing and the collection of data, the requirements and wishes of how the edited texts should be presented frequently changes, especially when the editor does not have a chance to “see” the edition before it is complete. Usually all the data is collected, the text is edited and the last step is to think about the different presentation formats. One of the first steps in the production of a digital scholarly edition should be to analyze what kind of information might be of interest to a broad audience, and how this should be structured so that it can be searched and displayed appropriately. The crucial point in the process of designing the data structure should be that different scholars have different intellectual requirements from resources. They are not always happy with how editors organize scholarly editions. Michael Sperberg-McQueen demanded in 1994, “Electronic scholarly editions should be accessible to the broadest audience possible.”[1] However, current scholarly editions are produced for a certain audience with very specialized research interests. The great potential of a digital scholarly edition is that it can be designed flexibly enough to meet the demands of people with many different viewpoints and interests. So why not let the audience make the decision as to which information is relevant and which is not? Some information might be of more interest, other information of less interest or could be entirely ignorable. Imagine a critical scholarly edition about medicine in ancient Greece provided by editors with a philological background. A philologist has different needs relating to this edition than, e.g., a medical historian, who might not be able to read Greek and might be only interested in reading about ancient medical practices. Who knows what kinds of annotations within the text are of interest for the recipients? The digital world provides us with many possibilities to improve scholarly editions, such as almost unlimited space to give complete information, more flexibility in organizing and presenting information, querying information, instant feedback and a range of other features. We have to think about how we can use these benefits to establish a (perhaps formalized) workflow to give scholars the chance to validate their work while it is in progress and to present their work in progress in an appropriate way for discussion within a community. This would not just validate the technical form of the work, but also improve the quality of the content, often due to ongoing feedback, and improve the intellectual benefits. On the other hand, digital editions should be brought online in an easy way without weeks of planning and months of developing software that may fit the basic needs of the editors but in most cases is just re-inventing the wheel. If needed, digitized images should be included easily and must be citeable. As it might be about work in progress, all stages of work must be saved and documented. Therefore, a versioning system is needed that allows referencing of all work steps. Finally, it is necessary that the scholar is able to check his or her work viewed in various media by the click of a button - for example, how the edition looks like on screen or printed, or even with different layouts or website designs. What is the potential of such a system? It offers an easy way to present the current state of a work in progress. Scholars can check their work at any time. But this is not useful without modifications for the special needs of certain digital editions and special navigation issues. Nevertheless, it can be used as base system extensible by own scripts, which implement the needs of a concrete project. And last but not least, such a system should offer the possibility of re-using data that is based on the same standard for other projects. This is especially true for more formalized data, such as biographical or bibliographical information, which could be used across different projects that concern the same issue or the same period. This paper will give a practical example of what an architecture that follows the aforementioned demands would look like. This architecture gives scholars the possibility of producing a scholarly edition using open standards. The texts can be encoded directly in XML or using WYSIWYG-like methods, such as possible with the oXygen XML editor or WORD XML exports. The “Scalable Architecture for Digital Editions” (SADE) developed at the Berlin-Brandenburg Academy of Sciences and Humanities is a modular and freely scalable system that can be adapted by different projects that follow the guidelines of the Text Encoding Initiative (TEI)[2] or easily use other XML standards as input formats. Using the TEI is more convenient, as less work in the modification of the existing XQuery and XSLT scripts needs to be done. _____________________________________________________________________________ 101 Digital Humanities 2008 _____________________________________________________________________________ Scalability of SADE relates to the server side as well as to the browser side. Browser-side scalability is equivalent to the users’ needs or research interests. The user is able to arrange the information output as he or she likes. Information can be switched on or off.The technological base for this functionality is AJAX or the so-called Web 2.0 technologies. References Server-side scalability is everything that has to do with querying the database and transforming the query results into HTML or PDF. As eXist[3], the database we use, is a native XML database, the whole work can be done by XML-related technologies such as XQuery and XSLT. These scripts can be adapted with less effort to most projects’ needs. [5] http://pom.bbaw.de/cmg/ For the connection between text and digitized facsimile, SADE uses Digilib[4], an open-source software tool jointly developed by the Max-Planck-Insitute for the History of Science, the University of Bern and others. Digilib is not just a tool for displaying images, but rather a tool that provides basic image editing functions and the capability of marking certain points in the image for citation.The versioning system at the moment is still a work in progress, but will be available by conference time. Documents can be queried in several ways - on the one hand with a standard full text search in texts written in Latin or Greek letters, and on the other hand by using a more complex interface to query structured elements, such as paragraphs, names, places, the apparatus, etc. These query options are provided by SADE. Furthermore, all XML documents are available not rendered in raw XML format, and so can be integrated in different projects rendered in a different way. SADE could be the next step in improving the acceptance of digital editions. Texts are accessible in several ways. The editor decisions are transparent and comprehensible at every stage of work, which is most important for the intellectual integrity of a scholarly edition.The digitized facsimiles can be referenced and therefore be discussed scientifically.The database back end is easy to handle and easy to adapt to most projects’ needs. [1] Cited in http://www.iath.virginia.edu/~jmu2m/mla-cse.2002.html [2] http://www.tei-c.org/index.xml [3] http://www.exist-db.org/ [4] http://digilib.berlios.de/ [6] http://www.bbaw.de/ Bibliography Burnard, L.; Bauman, S. (ed.) (2007): TEI P5: Guidelines for Electronic Text Encoding and Interchange. <http://www.tei-c.org/ release/doc/tei-p5-doc/en/html/index.html> Buzzetti, Dino; McGann, Jerome (2005): Critical Editing in a Digital Horizon. In: John Unsworth, Katharine O’Brien O’Keeffe u. Lou Burnard (Hg.): Electronic Textual Editing. 37–50. Czmiel, A.; Fritze, C.; Neumann, G. (2007): Mehr XML – Die digitalen Projekte an der Berlin-Brandenburgischen Akademie der Wissenschaften. In: Jahrbuch für Computerphilologie -online, 101-125. <http://computerphilologie.tu-darmstadt.de/jg06/ czfrineu.html> Dahlström, Mats (2004): How Reproductive is a Scholarly Edition? In: Literary and Linguistic Computing 19/1, S. 17–33. Robinson, Peter (2002): What is a Critical Digital Edition? In: Variants – Journal of the European Society for Textual Scholarship 1, 43–62. Shillingsburg, Peter L. (1996): Principles for Electronic Archives, Scholarly Editions, and Tutorials. In: Richard Finneran (ed.): The Literary Text in the Digital Age. 23–35. A working example, which is work in progress and extended continually, is the website of the “Corpus Medicorum Graecorum / Latinorum”[5] long term academy project at the Berlin-Brandenburg Academy of Sciences and Humanities[6]. This website exemplifies how a recipient can organize which information is relevant for his or her information retrieval. Other examples can be found at http://pom.bbaw.de. _____________________________________________________________________________ 102 Digital Humanities 2008 _____________________________________________________________________________ Talia: a Research and Publishing Environment for Philosophy Scholars Stefano David sdavid@delicias.dia.fi.upm.es DEIT - Università Politecnica delle Marche, Italy Michele Nucci mik.nucci@gmail.com DEIT - Università Politecnica delle Marche, Italy Francesco Piazza f.piazza@univpm.it DEIT - Università Politecnica delle Marche, Italy Introduction Talia is a distributed semantic digital library and publishing system, which is specifically designed for the needs of the scholarly research in the humanities. The system is designed for the diverging needs of heterogeneous communities. It is strictly based on Semantic Web technology1 and computational ontologies for the organisation of knowledge and will help the definition of a state-ofthe-art research and publishing environment for the humanities in general and for philosophy in particular. Talia is developed within the Discovery project, which aims at easing the work of philosophy scholars and researchers, making a federation of distributed digital archives (nodes) each of them dedicated to a specific philosopher: Pre-Socratic, Diogenes Laertius, Leibniz, Nietzsche, Wittgenstein, Spinoza, and others. Use Case and Requirements One of the use cases that Talia should satisfy is called The Manuscripts Comparer.The archive contains digital reproductions of handwritten manuscripts by one philosopher. The scholar wants to browse them like books, navigating by chapters and pages.The archive also contains transcriptions of the different paragraphs, on which to click to see the correct transcription. Moreover, the scholar is interested in comparing different versions of the same thought, placed by the philosopher in different paragraphs of his works, so the scholar needs to find all those different versions. Finally, he/she can write his own remarks on those paragraphs, so that they are published in the archive, after a peer review process. From this use case, we can devise some technical requirements that drive the design of Talia: 1. Making metadata remotely available. The metadata content of the archives shall be made available through interfaces similar to those of URIQA2, which will allow automated agents and clients to connect to a Talia node and ask for metadata about a resource of interest to retrieve its description and related metadata (e.g., the author, the resource type, etc.). This type of services enable numerous types of digital contents reuse. 2. Querying the system using standard query-languages. It shall possible to directly query the distributed archives using the SPARQL Semantic Web Query Language. SPARQL provides a powerful and standard way to query RDF repositories but also enables merging of remote datasets. 3. Transformation of encoded digital contents into RDF. The semantic software layer shall provide tools to transform textually encoded material, for example the transcription of manuscripts in TEI3 format, into RDF. 4. Managing structural metadata. “Structural” metadata describes the structure of a document (e.g., its format, its publisher, etc.). It shall be necessary to provide tools to manage these kinds of metadata. 5. Import and exporting RDF data. It shall be important to include facilities in Talia to import and export RDF data with standard and well know formats like RDF/XML and N-Triples. Talia should also provide facilities to enrich the content of the archives with metadata, whose types and formats depend on the kind of source and on the user community which works on the data. For example, in some archives, different versions of the same paragraph may exist in different documents. In order to allow the users to follow the evolution of that particular philosophical thought, the relations among these different versions must be captured by the system. An Overview of the Talia System Talia is a distributed semantic digital library system which combines the features of digital archives management with an on-line peer-review system. The Talia platform stores digital objects, called sources, which are identified by their unique and stable URI. Each source represents either a work of a philosopher or a fragment of it (e.g., a paragraph), and can have one or more data files (e.g., images or textual documents) connected to it.The system also stores information about the sources, which can never be removed once published and are maintained in a fixed state. Combined with other long-term preservation techniques,Talia allows the scholars to reference their works and gives the research community immediate access to new content. _____________________________________________________________________________ 103 Digital Humanities 2008 _____________________________________________________________________________ The most innovative aspect of Talia is that for the first time, at least in the field of humanities, all the interfaces exposed publicly will be based on proven Semantic Web standards enabling data interoperability within the Talia federation, and eases the data interchange with other systems. Two additional features of Talia are the highly customisable Graphic User Interface (GUI) and the dynamic contextualisation. Intuitively, a computational ontology is a set of assumptions that define the structure of a given domain of interest (e.g., philosophy), allowing different people to use the same concepts to describe that domain.Talia will use computational ontologies to organise information about writings of a philosopher or documents (manuscripts, essays, theses, and so on) of authors concerning that philosopher. Related Work Figure 1: An Example of the Talia User Interface. The web interface framework, based on modular elements called widgets, allows to build GUIs according to requirements coming from heterogeneous communities. Widgets can be packaged independently and used as building blocks for the application’s user interface. In particular, Talia provides semantic widgets that interact with an RDF Knowledge Base. To customise the site’s appearance, it also would be possible to add custom HTML rendering templates. Figure 1 shows a screenshot of a Talia’s GUI. The dynamic contextualisation provides a means for data exchange among different and distributed digital libraries based on the Talia framework. The dynamic contextualisation allows a Talia library (node) to share parts of its RDF graph in a peer-to-peer network. By using this mechanism, a node may notify another node that a semantic link between them exists, and the other node may use this information to update its own RDF graph and create a bidirectional connection. Talia uses computational ontologies and Semantic Web technology to help the creation of a state-of-the-art research and publishing environment. Talia will provide an innovative and adaptable system to enable and ease data interoperability and new paradigms for information enrichment, dataretrieval, and navigation. Computational Ontologies Ontologies have become popular in computer science as a means for the organisation of information.This connotation of ontology differs from the traditional use and meaning it has in philosophy, where ontologies are considered as A system of categories accounting for a certain vision of the world [2] In computer science, the concept of (computational) ontology evolved from the one first provided by Gruber [3], who defined an ontology as a specification of a conceptualisation4 to a more precise one, extracted from Guarino’s definitions [4]: A computational ontology is A formal, partial specification of a shared conceptualisation of a world (domain). Talia is directly related to the Hyper Platform which was used for the HyperNietzsche archive [1], a specialised solution, designed for specific needs of the Nietzsche communities. HyperNietzsche has a fixed graphical user interface, it does not use Semantic Web technology and it is not adaptable for different communities with heterogeneous needs. Talia shares some properties with other semantic digital library systems like JeromeDL [5], BRICKS [6], and Fedora [7]. However, these projects are mostly focused on the backend technology and none of them offers a flexible and highly customisable research and publishing system like Talia. Conclusion and Future Work Talia is a novel semantic digital web library system, which aims to improve scholarly research in the humanities, in particular in the field of philosophy. Thanks to the use of Semantic Web technology, Talia represents a very adaptable state-of-the art research and publishing system. The ontologies used in Talia are currently being developed by Discovery’s content partners, who are in charge of organising their contributions. These ontologies will be part of the final core Talia application. At the moment, only a first public demo version is available5. AlthoughTalia is currently intended only for philosophy scholars, it should be straightforward to adopt it for humanities, with the help of suitably developed ontologies. Using dynamic Ruby as programming language and the RubyOnRails framework, Talia provides an ideal framework for the rapid development of customised semantic digital libraries for the humanities. Acknowledgements This work has been supported by Discovery, an ECP 2005 CULT 038206 project under the EC eContentplus programme. We thank Daniel Hahn and Michele Barbera for the fruitful contributions to this paper. _____________________________________________________________________________ 104 Digital Humanities 2008 _____________________________________________________________________________ Notes 1 http://www.w3.org/2001/sw/ 2 http://sw.nokia.com/uriqa/URIQA.html 3 http://www.tei-c.org/ Bootstrapping Classical Greek Morphology Helma Dik 4 A more up-to-date definition and additional readings can be found at http://tomgruber.org/writing/ontologydefinition-2007.htm helmadik@mac.com University of Chicago, USA 5 http://demo.talia.discovery-project.eu Richard Whaling References rwhaling@uchicago.edu University of Chicago, USA [1] D’Iorio, P.: Nietzsche on new paths: The Hypernietzsche project and open scholarship on the web. In Fornari, C., Franzese, S., eds.: Friedrich Nietzsche. Edizioni e interpretazioni. Edizioni ETS, Pisa (2007) [2] Calvanese, D., Guarino, N.: Ontologies and Description Logics. Intelligenza Artificiale - The Journal of the Italian Association for Artificial Intelligence, 3(1/2) (2006) [3] Gruber, T.R.: A translation approach to portable ontology specifications. Knowledge Acquisition 5(2) (1993) 199-220 [4] Guarino, N.: Formal Ontology and Information Systems. In Guarino, N., ed.: 1st International Conference on Formal Ontologies in Information Systems, IOS Press (1998) 3-15 [5] Kurk S., Woroniecki T., Gzella A., Dabrowski M. McDaniel B.: Anatomy of a social semantic library. In: European Semantic Web Conference.Volume Semantic Digital Library Tutorial. (2007). [6] Risse T., Knezevic P., Meghini C., Hecht R., Basile F.: The bricks infrastructure an overview. In: The International Conference EVA, Moscow, 2005. [7] Fedora Development Team: Fedora Open Source Repository software. White Paper. In this paper we report on an incremental approach to automated tagging of Greek morphology using a range of already existing tools and data.We describe how we engineered a system that combines the many freely available resources into a useful whole for the purpose of building a searchable database of morphologically tagged classical Greek. The current state of the art in electronic tools for classical Greek morphology is represented by Morpheus, the morphological analyzer developed by Gregory Crane (Crane 1991). It provides all possible parses for a given surface form, and the lemmas from which these are derived.The rich morphology of Greek, however, results in multiple parses for more than 50% of the words (http://grade-devel.uchicago.edu/morphstats. html). There are fully tagged corpora available for pre-classical Greek (early Greek epic, developed for the ‘Chicago Homer’, http://www.library.northwestern.edu/homer/) and for New Testament Greek, but not for the classical period. Disambiguating more than half of our 3 million-word corpus by hand is not feasible, so we turned to other methods. The central element of our approach has been Helmut Schmid’s TreeTagger (Schmid 1994, 1995). TreeTagger is a Markovmodel based morphological tagger that has been successfully applied to a wide variety of languages. Given training data of 20,000 words and a lexicon of 350,000 words, TreeTagger achieved accuracy on a German news corpus of 97.5%. When TreeTagger encounters a form, it will look it up in three places: first, it has a lexicon of known forms and their tags. Second, it builds from that lexicon a suffix and prefix lexicon that attempts to serve as a a morphology of the language, so as to parse unknown words. In the absence of a known form or recognizable suffix or prefix, or when there are multiple ambiguous parses, it will estimate the tag probabilistically, based on the tags of the previous n (typically two) words; this stochastic model of syntax is stored as a decision tree extracted from the tagged training data. Since classical Greek presents, prima facie, more of a challenge than German, given that it has a richer morphology, and is a non-projective language with a complex syntax, we were initially unsure whether a Markov model would be capable of performing on Greek to any degree of accuracy. A particular complicating factor for Greek is the very large tagset: our full lexicon contains more than 1,400 tags, making it difficult for TreeTagger to build a decision tree from small datasets. Czech _____________________________________________________________________________ 105 Digital Humanities 2008 _____________________________________________________________________________ (Hajič 1998) is comparable in the number of tags, but has lower rates of non-projectivity (compare Bamman and Crane 2006:72 on Latin). Thus, for a first experiment, we built a comprehensive lexicon consisting of all surface forms occurring in Homer and Hesiod annotated with the parses occurring in the hand-disambiguated corpus--a subset of all grammatically possible parses--so that TreeTagger only had about 740 different possible tags to consider. Given this comprehensive lexicon and the Iliad and Odyssey as training data (200,000 words), we achieved 96.6% accuracy for Hesiod and the Homeric Hymns (see http:// grade-devel.uchicago.edu/tagging.html). The experiment established that a trigram Markov model was in fact capable of modeling Greek grammar remarkably well. The good results can be attributed in part to the formulaic nature of epic poetry and the large size of the training data, but they established the excellent potential of TreeTagger for Greek. This high degree of accuracy compares well with state-of-the-art taggers for such disparate languages as Arabic, Korean, and Czech (Smith et al., 2005). Unfortunately, the Homeric data form a corpus that is of little use for classical Greek. In order to start analyzing classical Greek, we therefore used a hand-tagged Greek New Testament as our training data (160,000 words). New Testament Greek postdates the classical period by some four hundred years, and, not surprisingly, our initial accuracy on a 2,000 word sample of Lysias (4th century BCE oratory) was only 84% for morphological tagging, and performance on lemmas was weak. Computational linguists are familiar with the statistic that turning to out-of-domain data results in a ten percent loss of accuracy, so this result was not entirely unexpected. At this point one could have decided to hand-tag an appropriate classical corpus and discard the out-of-domain data. Instead, we decided to integrate the output of Morpheus, thereby drastically raising the number of recognized forms and possible parses. While we had found that Morpheus alone produced too many ambiguous results to be practical as a parser, as a lexical resource for TreeTagger it is exemplary. TreeTagger’s accuracy on the Lysias sample rose to 88%, with much improved recognition of lemmas. Certain common Attic constructs, unfortunately, were missed wholesale, but the decision tree from the New Testament demonstrated a grasp of the fundamentals. While we are also working on improving accuracy by further refining the tagging system, so far we have seen the most prospects for improvement in augmenting our New Testament data with samples from classical Greek: When trained on our Lysias sample alone, TreeTagger performed at 96.8% accuracy when tested on that same text, but only performed at 88% on a new sample. In other words, 2,000 words of in-domain data performed no better or worse than 150,000 words of Biblical Greek combined with the Morpheus lexicon. We next used a combined training set of the tagged New Testament and the hand-tagged Lysias sample. In this case, the TreeTagger was capable of augmenting the basic decision tree it had already extracted from the NT alone with Attic-specific constructions. Ironically, this system only performed at 96.2% when turned back on the training data, but achieved 91% accuracy on the new sample (http://grade-devel.uchicago.edu/Lys2.html for results on the second sample).This is a substantial improvement given the addition of only 2,000 words of text, or less than 2% of the total training corpus. In the longer term, we aim at handdisambiguating 40,000 words, double that of Schmid (1995), but comparable to Smith et al. (2005). We conclude that automated tagging of classical Greek to a high level of accuracy can be achieved with quite limited human effort toward hand-disambiguation of in-domain data, thanks to the possibility of combining existing morphological data and machine learning, which together bootstrap a highly accurate morphological analysis. In our presentation we will report on our various approaches to improving these results still further, such as using a 6th order Markov model, enhancing the grammatical specificity of the tagset, and the results of several more iterations of our bootstrap procedure. References Bamman, David, and Gregory Crane (2006). The design and use of a Latin dependency treebank. In J. Hajič and J. Nivre (eds.), Proceedings of the Fifth International Workshop on Treebanks and Linguistic Theories (TLT) 2006, pp. 67-78. http:// ufal.mff.cuni.cz/tlt2006/pdf/110.pdf Crane, Gregory (1991). Generating and parsing classical Greek. Literary and Linguistic Computing, 6(4): 243-245, 1991. Hajič, Jan (1998). Building a syntactically annotated corpus: The Prague Dependency Treebank. In Eva Hajičova (ed.), Issues of Valency and Meaning, pp. 106-132. Schmid, Helmut (1994). Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, pp. 44-49. http://www.ims.unistuttgart.de/ftp/pub/corpora/tree-tagger1.pdf Schmid, Helmut (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the ACL SIGDAT-Workshop. http://www.ims.uni-stuttgart.de/ftp/ pub/corpora/tree-tagger2.pdf Smith, Noah A., David A. Smith, and Roy W. Tromble (2005). Context-based morphological disambiguation with random fields. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, pp. 475-482. http://www.cs.jhu.edu/~dasmith/sst_ emnlp_2005.pdf _____________________________________________________________________________ 106 Digital Humanities 2008 _____________________________________________________________________________ Mining Classical Greek Gender (Republic 3.394-395). Aristophanes, the comic playwright, has Euripides boast (Frogs 949f.) that he made tragedy democratic, allowing a voice to women and slaves alongside men. Helma Dik Gender, of course, also continues to fascinate modern readers of these plays. For instance, Griffith (1999: 51) writes, on Antigone: “Gender lies at the root of the problems of Antigone. (...) Sophocles has created one of the most impressive female figures ever to walk the stage.” Yet there are no full-scale studies of the linguistic characteristics of female speech on the Greek stage (pace McClure 1999). helmadik@mac.com University of Chicago, USA Richard Whaling rwhaling@uchicago.edu University of Chicago, USA This paper examines gendered language in classical Greek drama. Recent years have seen the emergence of data mining in the Humanities, and questions of gender have been asked from the start. Work by Argamon and others has studied gender in English, both for the gender of authors of texts (Koppel et al. 2002, Argamon et al. 2003) and for that of characters in texts (Hota et al. 2006, 2007 on Shakespeare); Argamon et al. (in prep. a) study gender as a variable in Alexander Street Press’s Black Drama corpus (gender of authors and characters) and (in prep. b) in French literature (gender of authors). Surprisingly, some of the main findings of these studies show significant overlap: Female authors and characters use more personal pronouns and negations than males; male authors and characters use more determiners and quantifiers. Not only, then, do dramatic characters in Shakespeare and modern drama, like male and female authors, prove susceptible to automated analysis, but the feature sets that separate male from female characters show remarkable continuity with those that separate male from female authors. For a scholar of Greek, this overlap between actual people and dramatic characters holds great promise. Since there are barely any extant Greek texts written by women, these results for English give us some hope that the corpus of Greek drama may serve as evidence for women’s language in classical Greek. After all, if the results for English-language literature showed significant differences between male and female characters, but no parallels with the differences found between male and female authors, we would be left with a study of gender characterization by dramatists with no known relation to the language of actual women. This is certainly of interest from a literary and stylistic point of view, but from a linguistic point of view, the English data hold out the promise that what we learn from Greek tragedy will tell us about gendered use of Greek language more generally, which is arguably a question of larger import, and one we have practically no other means of learning about. There is some contemporary evidence to suggest that Greek males considered the portrayal of women on stage to be true to life. Plato’s Socrates advises in the Republic against actors portraying women. Imitation must lead to some aspects of these inferior beings rubbing off on the (exclusively male) actors In this paper, we report our results on data mining for gender in Greek drama. We started with the speakers in four plays of Sophocles (Ajax, Antigone, Electra, and Trachiniae), for a total of thirty characters, in order to test, first of all, whether a small, non-lemmatized Greek drama corpus would yield any results at all. We amalgamated the text of all characters by hand into individual files per speaker and analyzed the resulting corpus with PhiloMine, the data mining extension to PhiloLogic (http:// philologic.uchicago.edu/philomine). In this initial experiment, words were not lemmatized, and only occurrences of individual words, not bigrams or trigrams, were used. In spite of the modest size, results have been positive. The small corpus typically resulted in results of “100% correct” classification on different tasks, which is to be expected as a result of overfitting to the small amount of data. More significantly, results on cross-validation were in the 80% range, whereas results on random falsification stayed near 50%.We were aware of other work on small corpora (Yu 2007 on Dickinson), but were heartened by these positive results with PhiloMine, which had so far been used on much larger collections. In our presentation, we will examine two questions in more depth, and on the basis of a larger corpus. First, there is the overlap found between work on English and French. Argamon et al. (2007) laid down the gauntlet: “The strong agreement between the analyses is all the more remarkable for the very different texts involved in these two studies. Argamon et al. (2003) analyzed 604 documents from the BNC spanning an array of fiction and non-fiction categories from a variety of types of works, all in Modern British English (post-1960), whereas the current study looks at longer, predominantly fictional French works from the 12th - 20th centuries. This cross-linguistic similarity could be supported with further research in additional languages.” So do we find the same tendencies in Greek, and if so, are we dealing with human, or ‘Western’ cultural, universals? Our initial results were mixed. When we ran a multinomial Bayes (MNB) analysis on a balanced sample, we did indeed see some negations show up as markers for female characters (3 negations in a top 50 of ‘female’ features; none in the male top _____________________________________________________________________________ 107 Digital Humanities 2008 _____________________________________________________________________________ 50), but pronouns and determiners show up in feature sets for both the female and male corpus. An emphatic form of the pronoun ‘you’ turned up as the most strongly male feature in this same analysis, and two more personal pronouns showed up in the male top ten, as against only one in the female top ten. Lexical items, on the other hand, were more intuitively distributed. Words such as ‘army’, ‘man’, ‘corpse’ and ‘weapons’ show up prominently on the male list; two past tense forms of ‘die’ show up in the female top ten. A larger corpus will allow us to report more fully on the distribution of function words and content words, and on how selections for frequency influence classifier results. References Argamon, S., M. Koppel, J. Fine, A. Shimoni 2003. “Gender, Genre, and Writing Style in Formal Written Texts”, Text 23(3). Argamon, S., C. Cooney, R. Horton, M. Olsen, S. Stein (in prep. a). “Gender, Race and Nationality in Black Drama.” Argamon, S., J.-B. Goulain, R. Horton, M. Olsen (in prep. b). “Vive la Différence! Text Mining Gender Difference in French Literature.” Griffith, M. (ed.), 1999. Sophocles: Antigone. Secondly, after expanding our corpus, regardless of whether we find similar general results for Greek as for English and French, we will also be able to report on variation among the three tragedians, and give more fine-grained analysis. For instance, in our initial sample, we categorized gods and choruses as male or female along with the other characters (there are usually indications in the text as to the gender of the chorus in a given play, say ‘sailors’, ‘male elders’, ‘women of Trachis’). Given the formal requirements of the genre, we expect that it will be trivial to classify characters as ‘chorus’ vs. ‘non-chorus’, but it will be interesting to see whether gender distinctions hold up within the group of choruses, and to what extent divinities conform to gender roles. The goddess Athena was the character most often mis-classified in our initial sample; perhaps this phenomenon will be more widespread in the full corpus. Such a finding would suggest (if not for the first time, of course) that authority and gender intersect in important ways, even as early as the ancient Greeks’ conceptions of their gods. Hota, S., S. Argamon, M. Koppel, and I. Zigdon, 2006. “Performing Gender: Automatic Stylistic Analysis of Shakespeare’s Characters.” Digital Humanities Abstracts 2006. Hota, S., S. Argamon, R. Chung 2007. “Understanding the Linguistic Construction of Gender in Shakespeare via Text Mining.” Digital Humanities Abstract 2007. Koppel, M., S. Argamon, A. Shimoni 2002. “Automatically Categorizing Written Texts by Author Gender”, Literary and Linguistic Computing 17:4 (2002): 401-12. McClure, L., 1999. Spoken Like a Woman: Speech and Gender in Athenian Drama. Yu, B. 2007. An Evaluation of Text-Classification Methods for Literary Study. Diss. UIUC. In conclusion, we hope to demonstrate that data mining Greek drama brings new insights, despite the small size of the corpus and the intense scrutiny that it has already seen over the centuries. A quantitative study of this sort has value in its own right, but can also be a springboard for close readings of individual passages and form the foundation for a fuller linguistic and literary analysis. _____________________________________________________________________________ 108 Digital Humanities 2008 _____________________________________________________________________________ Information visualization and text mining: application to a corpus on posthumanism Ollivier Dyens odyens@alcor.concordia.ca Université Concordia, Canada Dominic Forest dominic.forest@umontreal.ca Université de Montréal, Canada Patric Mondou patric.mondou@uqam.ca Université du Quebec à Montréal, Canada Valérie Cools Université Concordia, Canada David Johnston Université Concordia, Canada The world we live in generates so much data that the very structure of our societies has been transformed as a result. The way we produce, manage and treat information is changing; and the way we present this information and make it available raises numerous questions. Is it possible to give information an intuitive form, one that will enable understanding and memory? Can we give data a form better suited to human cognition? How can information display enhance the utility, interest and necessary features of data, while aiding memories to take hold of it? This article will explore the relevance of extracting and visualizing data within a corpus of text documents revolving around the theme of posthumanism. When a large quantity of data must be represented, visualizing the information implies a process of synthesis, which can be assisted by techniques of automated text document analysis. Analysis of the text documents is followed by a process of visualization and abstraction, which allows a global visual metaphoric representation of information. In this article, we will discuss techniques of data mining and the process involved in creating a prototype information-cartography software; and suggest how information-cartography allows a more intuitive exploration of the main themes of a textual corpus and contributes to information visualization enhancement. Introduction Human civilization is now confronted with the problem of information overload. How can we absorb an ever-increasing quantity of information? How can we both comprehend it and filter it according to our cognitive tools (which have adapted to understand general forms and structures but struggle with sequential and cumulative data)? How can we mine the riches that are buried in information? How can we extract the useful, necessary, and essential pieces of information from huge collections of data? Most importantly, how can we guarantee the literal survival of memory? For why should we remember when an uncountable number of machines remember for us? And how can we remember when machines won’t allow us to forget? Human memory has endured through time precisely because it is unaware of the greater part of the signals the brain receives (our senses gather over 11 million pieces of information per second, of which the brain is only aware of a maximum of 40 (Philipps, 2006, p.32)). Memory exists because it can forget. Is an electronic archive a memory? We must rethink the way that we produce and handle information. Can we give it a more intuitive form that would better lend itself to retention and understanding? Can we better adapt it to human cognition? How can we extrapolate narratives from databases? How can we insert our memories into this machine memory? The answer is simple: by visualizing it. For the past few years, the technical aspect of information representation and visualization has been the subject of active research which is gaining more and more attention (Card et al., 1999; Chen, 1999; Don et al., 2007; Geroimenko and Chen, 2005; Perer and Shneiderman, 2006; Spence, 2007). Despite the richness of the work that has been done, there is still a glaring lack of projects related to textual analyses, specifically of literary or theoretical texts, which have successfully integrated advances in the field of information visualization. Objectives Our project is part of this visual analytical effort. How can we visually represent posthumanism? How can we produce an image of its questions and challenges? How can we transform the enormous quantity of associated information the concept carries into something intuitive? We believe that the creation of a thematic map is part of the solution. Why a map? Because the information-driving suffocation we experience is also our inability to create cognitive maps (as Fredric Jameson suggested). And so we challenged ourselves to create a map of posthumanism, one that could be read and understood intuitively. To do this, we have chosen to use Google Earth (GE) as the basic interface for the project. GE’s interface is naturally intuitive. It corresponds to our collective imagination and it also allows users to choose the density of information they desire. GE allows the user to “dive” into the planet’s geographical, political, and social layers, or to stay on higher, more general levels. GE “tells” the story of our way of seeing the world. GE shows us the world the way science describes it and political maps draw it. This leads us to confront three primary challenges: the first was to find an area to represent posthumanism. The second was to give this area a visual form. The third was to integrate, in an intuitive way, the significant number of data and texts on the subject. _____________________________________________________________________________ 109 Digital Humanities 2008 _____________________________________________________________________________ Methodology In order to successfully create this thematic map, we first compiled a significant number of texts about posthumanism. The methodology used to treat the corpus was inspired by previous work in the field of text mining (Forest, 2006; IbekweSanjuan, 2007; Weiss et al., 2005). The goal of the analysis is to facilitate the extraction and organization of thematic groups. Non-supervised clustering technique is used in order to produce a synthetic view of the thematic area of the subject being treated. The data is processed in four main steps: 1. Segmentation of the documents, lexical extraction and filtering 2.Text transformation using vector space model 3. Segment classification 4. Information visualization Having established and sorted a sizeable informationpopulation, a new continent was needed. The continental outline of Antarctica was adopted. Antarctica was chosen because its contours are relatively unknown and generally unrecognizable; it tends to be thought of more as a white sliver at the bottom of the world map than a real place. Politically neutral Antarctica, whose shape is curiously similar to that of the human brain, is a large area surrounded by oceans. These qualities made it symbolically ideal for utilization in a map of posthumanism, a New World of the 21st century. Figure 1. A general level of the visualized thematic map The system offers users a certain number of areas, based on the algorithms used to process the data.These areas represent themes. Clicking on the brain icons allows the user to read an excerpt from one of the texts that is part of the cluster. Antarctica also allowed us to avoid information overload typical of a populated area: it carries minimal associative memories and historic bias; few people have lived there. To differentiate our new continent, the contour of Antarctica was moved into the mid Pacific.The continent was then reskinned visually with terrain suggestive of neurology, cybernetics, and symbiosis.The end result was a new land mass free of memories ready for an abstract population. Visualizing the results After identifying the thematic structure of the corpus, themes are converted into regions. Themes are assigned portions of the continent proportional to their occurrence. Inside each major region are 2 added levels of analysis: sub-regions and subsub-regions, each represented by several keywords. Keywords correspond to clusters discovered during the automated text analysis. So it is possible to burrow physically downward into the subject with greater and greater accuracy; or to rest at an attitude above the subject and consider the overall ‘terrain’. Each area is associated with a category of documents. From each area it is possible to consult the documents associated with each region. Figure2. A specific level of the thematic map When the user zooms in on a region, the application shows the next level in the hierarchy of data visualization.Within one theme (as shown below) several sub-themes appear (in red). A greater number of excerpts is available for the same area. _____________________________________________________________________________ 110 Digital Humanities 2008 _____________________________________________________________________________ disinvestment as the data eventually cancel each other out. We think that giving these data an intuitive form will make their meaning more understandable and provide for their penetration into the collective consciousness. Posthumanism seems particularly well adapted to pioneer this system because it questions the very definition of what it is to be human. References Alpaydin, E. (2004). Introduction to machine learning. Cambridge (Mass.): MIT Press. Aubert, N. (2004). “Que sommes-nous devenus?” Sciences humaines, dossier l’individu hypermoderne, no 154, novembre 2004, pp. 36-41. Card, S. K., Mackinlay, J. and Shneiderman, B. (1999). Readings in information visualization: using vision to think. Morgan Kaufmann. Figure 3. Looking at a document from the thematic map The icons indicating the availability of texts may be clicked on at any time, allowing the user to read the excerpt in a small pop-up window, which includes a link to the whole article.This pop-up window can serve to show pertinent images or other hyperlinks. Chen, C. (1999). Information visualisation and virtual environments. New York: Springer-Verlag. Clement, T., Auvil, L., Plaisant, C., Pape, G. et Goren,V. (2007). “Something that is interesting is interesting them: using text mining and visualizations to aid interpreting repetition in Gertrude Steins The Making of Americans.” Digital Humanities 2007. University of Illinois, Urbana-Champaign, June 2007. Don, A., Zheleva, E., Gregory, M., Tarkan, S., Auvil, L., Clement, T., Shneiderman, B. and Plaisant, C. (2007). Discovering interesting usage patterns in text collections: integrating text mining with visualization. Rapport technique HCIL-2007-08, Human-Computer Interaction Lab, University of Maryland. Garreau, J. (2005). Radical Evolution. New York: Doubleday. Geroimenko,V. and Chen, C. (2005). Visualizing the semantic web: XML-based Internet and information visualization. New York: Springer-Verlag. Figure 4. 3-dimensional rendering of the thematic map At any time, the user can rotate the point of view or change its vertical orientation.This puts the camera closer to the ground and allows the user to see a three dimensional landscape. Conclusion We see the visualization of data, textual or otherwise, as part of a fundamental challenge: how to transform information into knowledge and understanding. It is apparent that the significant amount of data produced by research in both science and the humanities is often much too great for any one individual. This overload of information sometimes leads to social Forest, D. (2006). Application de techniques de forage de textes de nature prédictive et exploratoire à des fins de gestion et d’analyse thématique de documents textuels non structurés. Thèse de doctorat, Montréal, Université du Québec à Montréal. Frenay, R. (2006). Pulse,The coming of age of systems, machines inspired by living things. New York: Farrar, Strauss and Giroux. Harraway, D. (1991). Simians, cyborgs, and women.The reinvention of nature. New York: Routledge. Ibekwe-SanHuan, F. (2007). Fouille de textes. Méthodes, outils et applications. Paris: Hermès. Jameson, F. (1991). Postmodernism, or, the cultural logic of late capitalism.Verso. _____________________________________________________________________________ 111 Digital Humanities 2008 _____________________________________________________________________________ Joy, B. (2000). “Why the future doesn’t need us.” Wired. No 8.04, April 2000. Kelly, K. (1994). Out of control.The rise of neo-biological civilization. Readings (Mass.): Addison-Wesley Publishing Company. Kurzweil, R. (2005). The singularity is near: when humans transcend biology.Viking. Levy, S. (1992). Artificial life. New York:Vintage Books, Random House. Manning, C. D. and H. Schütze. (1999). Foundations of statistical natural language processing. Cambridge (Mass.): MIT Press. Perer, A. and Shneiderman, B. (2006). “Balancing systematic and flexible exploration of social networks.” IEEE Transactions on Visualization and Computer Graphics.Vol. 12, no 5, pp. 693700. Philips, H. (2006). “Everyday fairytales.” New Scientist, 7-13 October 2006. Plaisant, C., Rose, J., Auvil, L.,Yu, B., Kirschenbaum, M., Nell Smith, M., Clement, T. and Lord, G. (2006). “Exploring erotics in Emily Dickinson’s correspondence with text mining and visual interfaces.” Joint conference on digital libraries 2006. Chapel Hill, NC, June 2006. Ruecker, S. 2006. “Experimental interfaces involving visual grouping during browsing.” Partnership: the Canadian journal of library and information practice and research.Vol. 1, no 1, pp. 1-14. Salton, G. (1989). Automatic Text Processing. Reading (Mass.): Addison-Wesley. Spence, R. (2007). Information visualization: design for interaction. Prentice Hall. Tattersall, I. (2006). “How we came to be human.” Scientific American, special edition: becoming human.Vol 16, no 2, pp. 66-73. Unsworth, J. (2005). New methods for humanities research. The 2005 Lyman award lecture, National Humanities Center, 11 November 2005. Unsworth, J. (2004). “Forms of attention: digital humanities beyond representation.” The face of text: computer-assisted text analysis in the humanities.The third conference of the Canadian symposium on text analysis (CaSTA). McMaster University, 1921 November 2004. Weiss, S. M., Indurkhya, N., Zhang, T. and Damereau, F. J. (2005). Text mining. Predictive methods for analyzing unstructured information. Berlin; New York: Springer-Verlag. How Rhythmical is Hexameter: A Statistical Approach to Ancient Epic Poetry Maciej Eder maciejeder@poczta.ijp-pan.krakow.pl Polish Academy of Sciences , Poland In this paper, I argue that the later a given specimen of hexamater is, the less rhythmical it tends to be. A brief discussion of the background of ancient Greek and Latin metrics and its connections to orality is followed by an account of spectral density analysis as my chosen method. I then go on to comment on the experimental data obtained by representing several samples of ancient poetry as coded sequences of binary values. In the last section, I suggest how spectral density analysis may help to account for other features of ancient meter. The ancient epic poems, especially the most archaic Greek poetry attributed to Homer, are usually referred as an extraordinary fact in the history of European literature. For present-day readers educated in a culture of writing, it seems unbelievable that such a large body of poetry should have been composed in a culture based on oral transmission. In fact, despite of genuine singers’ and audience’s memory, epic poems did not emerge at once as fixed texts, but they were re-composed in each performance (Lord 2000, Foley 1993, Nagy 1996). The surviving ancient epic poetry displays some features that reflect archaic techniques of oral composition, formulaic structure being probably the most characteristic (Parry 1971: 37-117). Since formulaic diction prefers some fixed rhythmical patterns (Parry 1971: 8-21), we can ask some questions about the role of both versification and rhythm in oral composition.Why was all of ancient epic poetry, both Greek and Latin, composed in one particular type of meter called hexameter? Does the choice of meter influence the rhythmicity of a text? Why does hexameter, in spite of its relatively restricted possibilities of shaping rhythm, differ so much from one writer to another some (cf. Duckworth 1969: 37-87)? And, last but not least, what possible reasons are there for those wide differences between particular authors? It is commonly known that poetry is in general easier to memorize than prose, because rhythm itself tends to facilitate memorization. In a culture without writing, memorization is crucial, and much depends on the quality of oral transmission. In epic poems from an oral culture rhythm is thus likely to be particularly important for both singers and hearers, even though they need not consciously perceive poetic texts as rhythmical to benefit from rhythm as an aid to memory. _____________________________________________________________________________ 112 Digital Humanities 2008 _____________________________________________________________________________ It may then be expected on theoretical grounds that non-oral poems, such as the Latin epic poetry or the Greek hexameter of the Alexandrian age, will be largely non-rhythmical, or at least display weaker rhythm effects than the archaic poems of Homer and Hesiod. Although formulaic diction and other techniques of oral composition are noticeable mostly in Homer’s epics (Parry 1971, Lord 2000, Foley 1993, etc.), the later hexameters, both Greek and Latin, also display some features of oral diction (Parry 1971: 24-36). The metrical structure of hexameter might be quite similar: strongly rhythmical in the oldest (or rather, the most archaic) epic poems, and less conspicuous in poems composed in written form a few centuries after Homer. The aim of the present study is to test the hypothesis that the later a given specimen of hexameter is, the less rhythmical it tends to be. Because of its nature versification easily lends itself to statistical analysis. A great deal of work has already been done in this field, including studies of Greek and Latin hexameter (Jones & Gray 1972, Duckworth 1969, Foley 1993, etc.). However, the main disadvantage of the methods applied in existing research is that they describe a given meter as if it were a set of independent elements, which is actually not true. In each versification system, the specific sequence of elements plays a far more important role in establishing a particular type of rhythm than the relations between those elements regardless their linear order (language “in the mass” vs. language “in the line”; cf. Pawlowski 1999). Fortunately, there are a few methods of statistical analysis (both numeric and probabilistic) that study verse by means of an ordered sequence of elements. These methods include, for example, time series modeling, Fourier analysis, the theory of Markov chains and Shannon’s theory of information. In the present study, spectral density analysis was used (Gottman 1999, Priestley 1981, etc.). Spectral analysis seems to be a very suitable tool because it provides a cross-section of a given time series: it allows us to detect waves, regularities and cycles which are not otherwise manifest and open to inspection. In the case of a coded poetry sample, the spectrogram shows not only simple repetitions of metrical patterns, but also some subtle rhythmical relations, if any, between distant lines or stanzas. On a given spectrogram, a distinguishable peak indicates the existence of a rhythmical wave; numerous peaks suggest a quite complicated rhythm, while a pure noise (no peaks) on the spectrogram reflects a non-rhythmical data. ancient verse was purely quantitative or whether it also had some prosodic features (Pawlowski & Eder 2001), the quantity-based nature of Greek and Roman meter was never questioned. It is probable that rhythm was generated not only by quantity (especially in live performances), but it is certain that quantity itself played an essential role in ancient meter. Thus, in the coding procedure, all prosodic features were left out except the quantity of syllables (cf. Jones & Gray 1972, Duckworth 1969, Foley 1993, etc.). A binary-coded series was then obtained for each sample, e.g., book 22 of the Iliad begins as a series of values: 1110010010010011100100100100111001110010010011... The coded samples were analyzed by means of the spectral density function. As might be expected, on each spectrogram there appeared a few peaks indicating the existence of several rhythmical waves in the data. However, while the peaks suggesting the existence of 2- and 3-syllable patterns it the text were very similar for all the spectrograms and quite obvious, the other peaks showed some large differences between the samples. Perhaps the most surprising was the peak echoing the wave with a 16-syllable period, which could be found in the samples of early Greek poems by Homer, Hesiod, Apollonius, and Aratos (cf. Fig. 1). The same peak was far less noticeable in the late Greek hexameter of Nonnos, and almost absent in the samples of Latin writers (cf. Fig. 2). Other differences between the spectrograms have corroborated the observation: the rhythmical effects of the late poems were, in general, weaker as compared with the rich rhythmical structure of the earliest, orally composed epic poems. Although the main hypothesis has been verified, the results also showed some peculiarities. For example, the archaic poems by Homer and Hesiod did not differ significantly from the poems of the Alexandrian age (Apollonius, Aratos), which was rather unexpected. Again, the rhythm of the Latin hexameter turned out to have a different underlying structure than that of all the Greek samples.There are some possible explanations of those facts, such as that the weaker rhythm of the Latin samples may relate to inherent differences between Latin and Greek. More research, both in statistics and in philology, is needed, however, to make such explanations more nuanced and more persuasive. To verify the hypothesis of hexameter’s decreasing rhythmicity, 7 samples of Greek and 3 samples of Latin epic poetry were chosen. The specific selection of sample material was as follows: 3 samples from Homeric hexameter (books 18 and 22 from the Iliad, book 3 from the Odyssey), 1 sample from Hesiod (Theogony), Apollonius (Argonautica, book 1), Aratos (Phainomena), Nonnos (Dionysiaca, book 1), Vergil (Aeneid, book 3), Horace (Ars poetica), and Ovid (Metamorphoses, book 1). In each sample, the first 500 lines were coded in such a way that each long syllable was assigned value 1, and each short syllable value 0. Though it is disputed whether _____________________________________________________________________________ 113 Digital Humanities 2008 _____________________________________________________________________________ Pawlowski, Adam, and Maciej Eder. Quantity or Stress? Sequential Analysis of Latin Prosody. Journal of Quantitative Linguistics 8.1 (2001): 81-97. Priestley, M. B. Spectral Analysis and Time Series. London: Academic Press, 1981. Bibliography Duckworth, George E. Vergil and Classical Hexameter Poetry: A Study in Metrical Variety. Ann Arbor: University of Michigan Press, 1969. Foley, John Miles. Traditional Oral Epic: “The Odyssey”, “Beowulf ” and the Serbo-Croatian Return Song. Berkeley: University of California Press, 1993. Gottman, John Mordechai, and Anup Kumar Roy. Sequential Analysis: A Guide for Behavioral Researchers. Cambridge: Cambridge University Press, 1990. Jones, Frank Pierce, and Florence E. Gray. Hexameter Patterns, Statistical Inference, and the Homeric Question: An Analysis of the La Roche Data.Transactions and Proceedings of the American Philological Association 103 (1972): 187-209. Lord, Albert B. Ed. Stephen Mitchell and Gregory Nagy. The Singer of Tales. Cambridge, MA: Harvard University Press, 2000. Nagy, Gregory. Poetry as Performance: Homer and Beyond. Cambridge: Cambridge University Press, 1996. Parry, Milman. Ed. Adam Parry. The Making of Homeric Verse: The Collected Papers of Milman Parry. Oxford: Clarendon Press, 1971. Pawlowski, Adam. Language in the Line vs. Language in the Mass: On the Efficiency of Sequential Modeling in the Analysis of Rhythm. Journal of Quantitative Linguistics 6.1 (1999): 70-77. _____________________________________________________________________________ 114 Digital Humanities 2008 _____________________________________________________________________________ TEI and cultural heritage ontologies and IFLA’s FRBR (FRBR 1998) is in the process of being completed. The EAD has already been mapped to CIDOC CRM (Theodoridou 2001). Øyvind Eide CIDOC CRM is defined in an object oriented formalism which allow for a compact definition with abstraction and generalisation.The model is event centric, that is, actors, places and objects are connected via events. CIDOC CRM is a core ontology in the sense that the model does not have classes for all particulars like for example the Art and Architecture Thesaurus with thousands of concepts. CIDOC CRM has little more than 80 classes and 130 properties. The most central classes and properties for data interchange are shown below. oeide@edd.uio.no University of Oslo, Norway Christian-Emil Ore c.e.s.ore@edd.uio.no University of Oslo, Norway Introduction Since the mid 1990s there has been an increase in the interest for the design and use of conceptual models (ontologies) in humanities computing and library science, as well as in knowledge engineering in general. There is also a wish to use such models to enable information interchange. TEI has in its 20 years of history concentrated on the mark up of functional aspects of texts and their parts. That is, a person name is marked but linking to information about the real world person denoted by that name was not in the main scope. The scope of TEI has gradually broadened, however, to include more real world information external to the text in question.The Master project (Master 2001) is an early example of this change. In TEI P5 a series of new elements for marking up real world information are introduced and several such elements from the P4 are adjusted. TEI P5 is meant to be a set of guidelines for the encoding of a large variety of texts in many cultural contexts. Thus the set of real world oriented elements in TEI P5 should not formally be bound to a single ontology. The ontological part of TEI P5 is, however, close connected to the authors implicit world view. Thus we believe it is important to study this part of TEI P5 with some well defined ontology as a yardstick. Our long experience with memory institution sector makes CIDOC CRM (Conceptual Reference Model) a natural choice. CIDOC CRM (Crofts 2005) has been proven useful as an intellectual aid in the formulation of the intended scope of the elements in a new mark up schemes and we believe the model can be useful to clarify the ontological part of TEI. This will also clarify what is needed in order to harmonize it with major standards like CIDOC CRM, FRBR, EAD and CDWA Lite. CIDOC CRM CIDOC CRM is a formal ontology intended to facilitate the integration, mediation and interchange of heterogeneous cultural heritage information. It was developed by interdisciplinary teams of experts, coming from fields such as computer science, archaeology, museum documentation, history of arts, natural history, library science, physics and philosophy, under the aegis of the International Committee for Documentation (CIDOC) of the International Council of Museums (ICOM). The harmonisation of CIDOC CRM Example: The issue of a medieval charter can be modelled as an activity connecting the issuer, witnesses, scribe and place and time of issue. The content of the charter is modelled as a conceptual object and the parchment as a physical thing. In cases where it is necessary for a scholarly analysis and when sufficient information has been preserved, an issuing of a charter can be broken down into a series of smaller events, e.g., the actual declaration in a court, the writing of the parchment and the attachment of the seals. This conceptual analysis can be can be used an intellectual aid in the formulation a data model and implementation. In 2005 the CRM was reformulated as a simple XML DTD, called CRM-Core, to enable CRM compliant mark up of multimedia metadata (Sinclair 2006). A CRM-Core XML package may contain information about a single instance of any class in the CRM and how it may be connected to other objects via events and properties. The German Council of Museum has based its standard for XML based museum data interchange, MUSEUMDAT, on a combination of the Getty standard CDWA Lite and CRM Core. The CDWA Lite revision group currently considers these changes to CDWA Lite in order to make it compatible with CRM. TEI P5 ontology elements in the light of CIDOC CRM In TEI P5 the new ontologically oriented elements is introduced in the module NamesPlaces described in chapter 13 Names, Dates, People, and Places. There are additional elements described in chapter 10 Manuscript Description, in the TEI header and in connection with bibliographic descriptions as well. In this paper we concentrate on the elements in chapter 13. _____________________________________________________________________________ 115 Digital Humanities 2008 _____________________________________________________________________________ The central elements in this module are: person, personGrp, org, place and event. Person, personGrp and org are “elements which provide information about people and their relationships”. CIDOC CRM has the corresponding classes with a common superclass E29 Actor. The element event is defined as “contains data relating to any kind of significant event associated with a person, place, or organization” and is similar to the CIDOC CRM class E5 Event and its subclasses. In the discussion of the marriage example in chapter 13, event element is presented as a “freestanding” element. In the formal definition it is limited to person and org. To make this coherent, the formal part will have to be extended or the example have to be changed. Still event is problematic.The marriage example demonstrates that it is impossible to express the role a person has in an event. Without knowing the English marriage formalism one doesn’t know if the “best man” participated. The very generic element persRel introduced in P5 does not solve this problem. A possible solution to this problem would be to introduce an EventStateLike model class with elements for roles and participants. The model classes orgStateLike, personStateLike, personTraitLik, placeStateLike, placeTraitLike group elements used to mark up characteristics of persons, organisations and places. The elements in ...TraitLike model classes contain information about permanent characteristics and the elements in ...StateLike information about more temporal characteristics. The model classes contain the generic Trait and State elements in addition to specialised elements. The intention is to link all characteristics relating to a person, organisation or place. It is not possible to make a single mapping from these classes into CIDOC-CRM. It will depend partly on which type of trait or strait is used, and partly on the way in which it is used. Many characteristics will correspond to persistent items like E55 Types, E3 String and E41 Appellation, and are connected to actors and places through the properties P1 is identified, P2 has type and P2 has note. Other elements like floruit, which is used to describe a person’s active period, are temporal states corresponding to the CIDOC-CRM temporal entity E3 Condition State. From an ontological point of view the two elements state and trait can be considered as generic mechanism for typed linking between the major classes. All the elements in ...TraitLike and ...StateLike model classes can be supplied with the attributes notAfter and notBefore defining the temporal extension of their validity. This is a very powerful mechanism for expressing synoptically information based on hidden extensive scholarly investigation about real world events. As long as the justification for the values in these elements is not present, however, it is hard to map this information into an event oriented conceptual model like the CRM. Thus, it is important to include descriptions of methods for such justification in the guidelines, including examples. TEI ontology – conclusion The new elements in TEI P5 bring TEI a great step in the direction of an event oriented model. Our use of CRM Core as a yardstick has shown that small extensions to and adjustments of the P5 elements will enable the expression of CRM Core packages by TEI elements.This is a major change to our previous suggestions (Ore 2006) in which the ontological module was outside TEI. To continue this research, an extended TEI tagset should be developed with element for abstracts corresponding to the ones in FRBR and CRM. This will not change the ontological structure of TEI significantly. But these adjustments will make the ontological information in a TEI document compliant with the other cultural heritage models like for example EAD, FRBR/FRBRoo, CIDOC CRM and CDWA-Lite. There is an ongoing harmonisation process between all these initiatives in which it is important that TEI is a part. Bibliography Crofts, N., Doerr, M., Gill, T., Stead, S. and Stiff M. (eds.) (2005): Definition of the CIDOC Conceptual Reference Model. (June 2005). URL: http://cidoc.ics.forth.gr/docs/cidoc_crm_ version_4.2.doc (checked 2007-11-15) CDWA Lite www.getty.edu/research/conducting_research/ standards/cdwa/cdwalite.html (checked 2007-11-25) FRBR (1998). Functional Requirement for Bibliographic Records. Final Report. International Federation of Library Associations. URL: http://www.ifla.org/VII/s13/frbr/frbr.pdf (checked 2007-11-24) MASTER (2001). “Manuscript Access through Standards for Electronic Records (MASTER).” Cover Pages: Technology Reports. URL: http://xml.coverpages.org/master.html (checked 2007-11-25) MUSEUMDAT (www.museumdat.org/, checked 2007-11-25) Ore, Christian-Emil and Øyvind Eide (2006). “TEI, CIDOCCRM and a Possible Interface between the Two.” P. 62-65 in Digital Humanities 2006. Conference Abstracts. Paris, 2006. Sinclair, Patrick & al.(2006). “The use of CRM Core in Multimedia Annotation.” Proceedings of First International Workshop on Semantic Web Annotations for Multimedia (SWAMM 2006). URL: http://cidoc.ics.forth.gr/docs/paper16. pdf (checked 2007-11-25) TEI P5 (2007). Guidelines for Electronic Text Encoding and Interchange. URL: http://www.tei-c.org/Guidelines/P5/ (checked 2007-11-15) _____________________________________________________________________________ 116 Digital Humanities 2008 _____________________________________________________________________________ Theodoridou, Maria and Martin Doerr (2001). Mapping of the Encoded Archival Description DTD Element Set to the CIDOC CRM, Technical Report FORTH-ICS/TR-289. URL: http:// cidoc.ics.forth.gr/docs/ead.pdf (checked 2007-11-25) Homebodies and GadAbouts: A Chronological Stylistic Study of 19th Century French and English Novelists Joel Goldfield joel@cs.fairfield.edu Fairfield University, USA David L. Hoover david.hoover@nyu.edu New York University, USA The assumption that authorial style changes over time is a reasonable one that is widely accepted in authorship attribution. Some important studies concentrate on individual authors, using a wide array of methods, and yielding varied results, however, so that is unwise to generalize (see the overview in Stamou 2007). Henry James’s style, for example, changes in a remarkably consistent and extreme way (Hoover 2007), and Austen’s stylistic development also seems consistent (Burrows 1987), but Beckett’s texts are individually innovative without showing consistent trends (Stamou 2007: 3). For some authors, different investigators have obtained inconsistent results. We know of no general study that investigates how authorial style changes over a long career.We thus explore how working and otherwise living abroad for periods exceeding two years may affect an author’s vocabulary and style. Our presentation begins a broadly based study of the growth and decay of authorial vocabulary over time. Although we limit our study to nineteenth-century prose, to reduce the number of possible variables to a manageable level, we compare French and English authors, allowing us to test whether any discovered regularities are cross-linguistic. Because authors’ vocabularies seem related to their life experiences, we compare authors who spent most of their lives in a single geographical area with others who traveled and lived extensively abroad. For the purposes of this study, we define extensive living and working abroad as at least two consecutive years, somewhat longer than the contemporary American student’s or faculty member’s stay of a semester or a year. This differentiation allows us to investigate in what significant ways, if any, extensive foreign travel affects an author’s vocabulary. Our investigation of these questions requires a careful selection of authors and texts. Although research is still ongoing, we have selected the following eight authors–two in each of the four possible categories–for preliminary testing and further biographical study: _____________________________________________________________________________ 117 Digital Humanities 2008 _____________________________________________________________________________ Domestic authors: English • George Meredith (1828-1909). German schooling at age fifteen, but apparently no significant travel later; fifteen novels published between 1855 and 1895 available in digital form. • Wilkie Collins (1824-1889). No foreign travel mentioned in brief biographies; 20 novels available over 38 years. French • Honoré de Balzac (1799-1850). He traveled little outside of France until 1843. Subsequent excursions abroad in the last seven years of his life, mainly for romance, were relatively infrequent and never exceeded five consecutive months. Over two dozen novels in digitized format are available. • Jules Barbey d’Aurevilly (1808-1889). Raised in France and schooled in the law, he was devoted to his native Normandy and seldom ventured abroad. Initially cultivating the image of the dandy, his novels and novellas create a curious paradox in his later writing between sexually suggestive themes and nostalgia for earlier aesthetics and a defense of Catholicism. His literary productivity can be divided into the three periods of 1831-1844 (first published novel and novella), 1845-1873 (return to Catholicism; work as both literary critic and novelist), and 1874-1889. At least fourteen novels or novellas are available in digitized format. Traveling authors: English • Dickens (1812-1870). Seven novels available 1836-43, foreign travel for 3 yrs 1844-47, and more travel in 1853-55; four novels after 1855 available. • Trollope (1815-1883). Travel to Brussels in 1834, but briefly; six novels available before 1859, Postal missions to Egypt, Scotland, and the West Indies, 1858-59; 1871 trip to Australia, New Zealand, and US; travel to Ceylon and Australia, 1875, South Africa 1877, Iceland 1878; five novels available after 1878. from 1831-1849, he was a protégé of Alexis de Tocqueville, who brought him into the French diplomatic service in 1849. Gobineau was then stationed in Switzerland and Germany (1849-1854), Persia (1855-1858, 1861-1863), Newfoundland (1859), Greece (1864-1868), Brazil (18681870) and Sweden (1872-1873). Following travel through Europe in 1875-77, he left the diplomatic service. His first short fiction was published and serialized in 1846. At least a dozen novellas written throughout his career (mostly written and published as collections) and two novels (1852 and 1874) are available in digitized format. • Victor Hugo (1802-85). Raised in France except for a six-month period in a religious boarding school in Madrid (1811-12), Hugo began writing his first novel, Bug-Jargal (1826) in 1820. This initial literary period includes 18201851. Aside from a few short trips lasting less than three months, Hugo lived and wrote mainly in his homeland until his exile on the Island of Guernsey during the reign of Louis-Napoléon from 1851-1870. The third period encompasses his triumphant return to Paris in 1871 until his death, celebrated by a state funeral, in 1885. Research on the French authors is being facilitated by use of PhiloLogic and the ARTFL database, complemented by local digitized works and tools that are also used to investigate the English authors. Preliminary testing must be done on all the authors to discover overall trends or principles of vocabulary development before investigating any possible effects of foreign travel. The importance of this can be seen in figures 1 and 2 below, two cluster analyses of English traveling authors, based on the 800 mfw (most frequent words). Dickens’s novels form two distinct groups, 1836-43 and 1848-70, a division coinciding with his 1844-47 travels. For Trollope, the match is not quite so neat, but it is suggestive. Ignoring Nina Balatka, a short romance set in Prague, and La Vendee, Trollope’s only historical novel, set in France in the 1790’s, only two remaining novels in the early group were published after his travel to Egypt, Scotland, and the West Indies in 1858-59. French • Arthur de Gobineau (1816-1882). Raised partly in France, partly in Germany and Switzerland, he learned German and began the study of Persian in school while his mother, accused of fraud and estranged from her military husband, kept the family a step ahead of the French police. Three periods encompassing his publication of fiction and often related to career travel can be divided as follows: 18431852 (at least 4 novellas and 1 novel); 1853-1863 (1 novella); 1864-1879 (10 novellas, 2 novels). Living mainly in France Fig. 1 – Fifteen Dickens Novels, 1836-1870 _____________________________________________________________________________ 118 Digital Humanities 2008 _____________________________________________________________________________ Fig. 2 -- Thirty Trollope Novels, 1848-1883 Unfortunately, comparing cluster analyses of the English domestic authors casts doubt on any simple correlation between travel and major stylistic change. Meredith, like Dickens and Trollope, shows a sharp break between his two early novels 1855-57 and the rest of his novels (see Fig. 3).And, ignoring Antonina (an historical novel about the fall of Rome), Collins’s novels also form early and late groups (see Fig. 4). Fig. 3 – Fifteen Meredith Novels, 1855-895 Fig. 5 – Six Texts by Gobineau We are still developing the techniques for the study of the effects of travel, but preliminary testing based on a division of each author’s career into early, middle, and late periods allows us to check for consistent trends rather than simple differences between early and late texts, and to begin comparing the four categories of authors. Choosing novelists with long careers allows us to separate the three periods, selecting natural gaps in publication where possible, but creating such gaps where necessary by omitting some texts from the study. For traveling authors, these divisions also take into account the timing of their travel, creating gaps that include the travel. Three stylistic periods create six patterns of highest, middle, and lowest frequency for each word in the an author’s texts. Depending on the number and size of the novels, we include approximately the 8,000 to 10,000 most frequent words, all those frequent enough to show a clear increase or decrease in frequency; we delete words that appear in only one period. Thus, as shown in Fig. 6, each of the six patterns would be expected to occur about one-sixth (17%) of the time by chance. Fig. 6 –Six Patterns of Change (“E” = early; “M” = middle; “L” = late) Fig. 4 – Twenty Collins Novels, 1850-1888 Furthermore, although our results for French authors are more preliminary, a test of six works by Gobineau shows that two of his earliest texts, written before any extensive travel, are quite similar to some of his latest texts (see Fig. 5). Results for the English authors, shown in Fig. 7, are both surprising and suggestive (note that the axis crosses at the “expected” 16.7% level, so that it is easy to see which patterns are more or less frequent than expected for each author. Gradual decrease in frequency, E > M > L, is the only pattern more frequent than expected for all four authors (Meredith’s figure is only very slightly more frequent than expected), and both M > E > L and L > E > M are less frequent than expected for all four authors. Although the patterns for these authors suggest some regularity in the growth and decay of vocabulary, no simple relationship emerges. Consider also the surprising fact that vocabulary richness tends to decrease chronologically for Dickens,Trollope, and possibly Collins, while only Meredith shows increasing vocabulary richness. (These comments are _____________________________________________________________________________ 119 Digital Humanities 2008 _____________________________________________________________________________ based on a relatively simple measure of vocabulary richness, the number of different words per 10,000 running words; for more discussion of vocabulary richness, see Tweedie and Baayen, 1998 and Hoover, 2003.) These facts contradict the intuitive assumption that the main trend in a writer’s total vocabulary should be the learning of new words. Similar comparisons are being developed for the French authors in question. The conference presentation will include not only a comparison of style within each language group, but between language groups. Such comparisons also build on strengths of corpus stylistics important to translation (Goldfield 2006) and a possible related future for comparative literature (Apter 2005). Fig. 7 – Patterns of Frequency Change in the Words of Six Authors. References Apter, Emily (2006) The Translation Zone: A New Comparative Literature, Princeton. Burrows, J. F. (1992a). “Computers and the study of literature.” In Butler, C. S., ed. Computers and Written Texts. Oxford: Blackwell, 167-204. Goldfield, J. (2006) “French-English Literary Translation Aided by Frequency Comparisons from ARTFL and Other Corpora,” DH2006: Conference Abstracts: 76-78. Hoover, D. L. (2003) “Another Perspective on Vocabulary Richness.” Computers and the Humanities, 37(2), 151-78. Hoover, D. L. (2007) “Corpus Stylistics, Stylometry, and the Styles of Henry James,” Style 41(2) 2007: 174-203. Stamou, Constantina. (2007) “Stylochronometry: Stylistic Development, Sequence of Composition, and Relative Dating.” LLC Advance Access published on October 1, 2007. Tweedie, F. J. and R. H. Baayen. 1998. “How Variable May a Constant Be? Measures of Lexical Richness in Perspective”. Computers and the Humanities 32 (1998), 323-352. The Heinrich-Heine-Portal. A digital edition and research platform on the web (www.heine-portal.de) Nathalie Groß hhp@uni-trier.de University Trier, Germany Christian Liedtke liedtke@math.uni-duesseldorf.de Heinrich-Heine-Institut Düsseldorf, Germany In this paper we are presenting the Heinrich Heine Portal, one of the most sophisticated web resources dedicated to a German classical author. It virtually combines the two most definitive critical editions (DHA=Düsseldorfer HeineAusgabe and HSA=Heine Säkularausgabe) in print together with digital images of the underlying textual originals within an elaborated electronic platform.The project, which has been established in 2002, is organized as a co-operation between the Heinrich-Heine-Institut (Düsseldorf) and the Competence Centre for Electronic Publishing and Information Retrieval in the Humanities (University of Trier). It has been funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Organisation) as well as the Kunststiftung Nordrhein-Westfalen (Foundation for the Arts of North Rhine Westphalia). The work within the project consists of two major parts. On the one hand it has to transfer the printed texts into a digital representation which serves as the basis for its electronic publication. On the other hand it aims at a complete revision of important parts of the printed critical edition and provides new results of the Heinrich-Heine research community. The first part of the workflow is organized as a typical retro-digitization project. Starting with the printed editions, consisting of nearly 26,500 pages with an amount of about 72 millions of characters, the text was sent to a service partner in China, where it was typed in by hand. During this process two digital versions of the text were produced and then sent back to Germany, where they were automatically collated. After that the listed differences were manually corrected by comparing them with the printed original. The result of this step is a digital text that corresponds with the printed version providing a quality of nearly 99.995%. The next task was to transform this basic digital encoding into a platform independent representation, which then can be used as the main data format for all following project phases. In order to achieve this goal a pool of program routines was developed which uses the typographical information and contextual conditions to generate XML-markup according to the TEI guidelines. Because of the heterogeneous structures of the text (prose, lyrics, critical apparatus, tables of contents etc.) this was a very time consuming step. At the end of this process a reusable version of the data that can be put _____________________________________________________________________________ 120 Digital Humanities 2008 _____________________________________________________________________________ into the archives on a long time basis exists. Using the XML encoding the digital documents were imported into an online database, where different views onto the data are stored separately, e.g. metadata for the main information about the letter corpus (list of senders, addressees, dates and places etc.) or bibliographical information on the works. In order to get access to the information from the database the project has been supplied with a sophisticated graphical user interface, which has been built with the help of a content management framework (ZOPE). Bibliography Besides providing online access to the basic editions DHA and HSA the Heinrich-Heine-Portal offers a completely revised and updated version of the letter corpus and displays newlydiscovered letters and corrigenda, which are being added constantly. Additionally, the portal is an electronic archive which contains and presents more than 12,000 digital images of original manuscripts and first editions, linked to the text and the apparatus of the edition. Most of those images were made available by the Heinrich-Heine-Institut, Düsseldorf, which holds nearly 60% of the Heine-manuscripts known today. The collection of the Heine-Institut was completely digitized for this project. In addition, 32 other museums, libraries and literary archives in Europe and the United States are cooperating with the Heine-Portal and have given us consent to present manuscripts from their collections.Among them are the British Library and the Rothschild Archive in London, the Russian Archives of Social and Political History in Moscow, the Foundation of Weimar Classics and many others. One of the long-term goals of the Heine-Portal is a “virtual unification” of all of Heine’s manuscripts. Beyond that the Heine-Portal offers extensive bibliographies of primary sources and secondary literature, from Heine’s own first publications in a small journal (1817) up to the most recent secondary literature. Other research tools of the Heine-Portal are a powerful search engine and a complex hyperlink structure which connects the texts, commentaries and the different sections of the Portal with each other and a database of Heine’s letters with detailed information on their edition, their availability and the institutions which hold them. Bernd Füllner and Christian Liedtke:Volltext, Web und Hyperlinks. Das Heinrich-Heine-Portal und die digitale Heine-Edition. In Heine-Jahrbuch 42, 2003, pp. 178-187. Bernd Füllner and Johannes Fournier: Das Heinrich-HeinePortal. Ein integriertes Informationssystem. In Standards und Methoden der Volltextdigitalisierung. Beiträge des Internationalen Kolloquiums an der Universität Trier, 8./9. Oktober 2001. Hrsg. von Thomas Burch, Johannes Fournier, Kurt Gärtner u. Andrea Rapp. Trier 2002 (Abhandlungen der Akademie der Wissenschaften und der Literatur, Mainz; Geistes- und Sozialwissenschaftliche Klasse), pp. 239-263. Bernd Füllner and Christian Liedtke: Die Datenbanken des Heinrich-Heine-Portals. Mit fünf unbekannten Briefen von und an Heine. In Heine-Jahrbuch 43, 2004, pp. 268-276. Nathalie Groß: Das Heinrich-Heine-Portal: Ein integriertes Informationssystem. In Uni-Journal 3/2004, pp. 25-26. Nathalie Groß: Der Digitale Heine – Ein Internetportal als integriertes Informationssystem. In Jahrbuch für Computerphilologie - online. Hg. v. Georg Braungart, Karl Eibl und Fotis Jannidis. München 2005, pp. 59-73. Available online: http://computerphilologie.uni-muenchen.de/jg04/gross/gross. html Christian Liedtke: Die digitale Edition im Heinrich-HeinePortal – Probleme, Prinzipien, Perspektiven. In editio. Internationales Jahrbuch für Editionswissenschaft. Tübingen 2006. Bernd Füllner and Nathalie Groß: Das Heinrich-Heine-Portal und digitale Editionen. Bericht über die Tagung im HeinrichHeine-Institut in Düsseldorf am 6. Oktober 2005. In HeineJahrbuch. 45. 2006, pp. 240-248. Our paper will describe the technical and philological aspects of the work process that was necessary for the completion of the Heine-Portal, it will give an account of its main functions and demonstrate their use for students and scholars alike, and it will discuss possible pedagogical applications for schools as well as university teaching. _____________________________________________________________________________ 121 Digital Humanities 2008 _____________________________________________________________________________ The New Middle High German Dictionary and its Predecessors as an Interlinked Compound of Lexicographical Resources Kurt Gärtner gaertnek@staff.uni-marburg.de Universität Trier, Germany The paper gives, firstly, a brief overview over the history of Middle High German (MHG) lexicography leading to the new MHG dictionary. Secondly, it describes the digitisation of the old MHG dictionaries which form a subnet in a wider net of German dictionaries. Thirdly, this network, together with specific tools and digital texts, is seen as an essential source for compiling the new MHG dictionary, of which, fourthly, the essential features, esp. its online version, are mentioned briefly, but will be demonstrated in detail in the full paper. 1. Looking back at the beginnings of Humanities Computing in the sixties, one of the major subjects was the support the computer would give to lexicographers. Many e-texts have been created then for mere lexicographical purposes (Wisbey 1988). The production of concordances and indices was at its heights between 1970 and 1980 (cf. Gärtner 1980). However, the road to real dictionary making came only gradually in sight when lemmatization was required and semantics began to dominate the discussion of computer application in lexicography. The work on the dictionaries to the medieval Germanic languages benefited a great deal from this development. This has certainly been the case regarding MHG, the period of German from ca. 1050 to ca. 1350 resp. – in the earlier periodisations – to 1500. During the 20th century, the situation of MHG lexicography had become more and more deplorable. Since Lexer’s dictionary appeared in 1878, all plans for a new dictionary had been unsuccessful. Therefore, many editors of MHG texts published after 1878 provided glossaries as a means of presenting such material that was not included in the earlier dictionaries edited by Benecke/Müller/Zarncke (BMZ) and Lexer with its Nachträge. It was not until 1985 that a group of researchers at the University of Trier started compiling the Findebuch which was published in 1992. The Findebuch was a compilation of glossaries to editions in order to make up for the lack of a new MHG dictionary. Also in the late eighties, on the basis of the work on the Findebuch, planning began to work out a new MHG dictionary (cf. Gärtner/Grubmüller 2000). This time the plans were successful, and work on the new dictionary started in 1994. The scientific use of the computer had meanwhile become a conditio sine qua non, also for the new MHG dictionary (cf. Gärtner/Plate/Recker 1999). 2. The work on a new dictionary relies to a great deal on its predecessors. As the Findebuch had been compiled with extensive use of the computer and existed in machine readable form, the need for digitizing the old MHG dictionaries was strongly felt from the beginning of the new dictionary’s planning. The old dictionaries (BMZ, Lexer with Nachträge) and the Findebuch are closely interconnected and can only be used simultaneously. They were ideal candidates for the composition of an electronic dictionary compound.Therefore, as a supporting research project to the new dictionary, the old ones were digitized and interlinked thus forming a digital network (MWV).The work on this project has been described to the LLC readers in detail (Fournier 2001; see also Burch/ Fournier 2001). Moreover, this network formed the starting point for a wider net that interlinks many more dictionaries to the German language and its dialects (cf.Woerterbuchnetz). The most prominent among them is the 33vols. Deutsches Wörterbuch (DWB) by the brothers Grimm. 3. The work on the new MHG dictionary (MWB) was considerably supported by two more projects. First and most essential for lexicographers working in different places was the implementation of a web based workbench for the composition of dictionaries.The development of a toolkit for a collaborative editing and publishing of the MWB has also been described to LLC readers in detail (Queens/Recker 2005).The second project supporting the work on the new dictionary was the extension of the digital text archive. At the beginning of the work on the MWB, the lexicographical workbench had to rely only on a small corpus of digital texts. This number was effectively increased by a collaborative project with the Electronic Text Center of the University of Virginia at Charlottesville (cf. Recker 2002 and MHDTA/MHGTA 2004). By now nearly all the texts of the Findebuch corpus have been digitized and added to the corpus of the new dictionary. 4. In 2006 the first instalment of the new MWB was published together with a CD-ROM containing a PDF of the printed form. The online version was to follow in 2007. All digital resources created for the new dictionary are firstly to the advantage of the lexicographer, and secondly and still more important to the advantage of the user as well. The lexicographer’s workbench provides an alphabetical list of headwords (Lemmaliste), it offers an easy and quick access to the old dictionaries (BMZ, Lexer, Findebuch), to the lemmatized archive of instances (Belegearchiv) in form of a KWIC-concordance, and finally to the whole corpus of etexts which can be searched for more evidences. Not only the lexicographer, but also the user of the online version has access to all information on the usage of a MHG word, as I will demonstrate in detail. As the new MWB is a long term project, the online version not only offers the already published instalments, but also the complete material the forthcoming fascicles are based upon. Furthermore, for the entries still to be worked out the old dictionaries are always present for consultation together with the new material of the archive and the corpus of etexts. Thus, linking old and new lexicographical resources _____________________________________________________________________________ 122 Digital Humanities 2008 _____________________________________________________________________________ proves immensely valuable not only for the workbench of the lexicographer, but also for the user of his work. Images Bibliography (Apart from the dictionaries, preferably research papers in English are listed) BMZ = Georg Friedrich Benecke, Wilhelm Müller, Friedrich Zarncke: Mittelhochdeutsches Wörterbuch. Leipzig 1854–1866. Nachdruck mit einem Vorwort und einem zusammengefaßten Quellenverzeichnis von Eberhard Nellmann sowie einem alphabetischen Index von Erwin Koller, Werner Wegstein und Norbert Richard Wolf. Stuttgart 1990. Burch, Thomas / Fournier, Johannes / Gärtner, Kurt / Rapp, Andrea (Eds.): Standards und Methoden der Volltextdigitalisierung. Beiträge des Internationalen Kolloquiums an der Universität Trier, 8./9. Oktober 2001 (Abhandlungen der Akademie der Wissenschaften und der Literatur, Mainz; Geistes- und sozialwissenschaftliche Klasse; Einzelveröffentlichung Nr. 9). Stuttgart 2002. Lexer’s MHG Dictionary within the Dictionary Net Burch, Thomas / Fournier, Johannes: Middle High German Dictionaries Interlinked. In: Burch/Fournier/Gärtner/Rapp 2002, p. 297-301. Online-version: http://germazope.uni-trier. de/Projects/MWV/publikationen DWB = Deutsches Wörterbuch der Brüder Grimm. Erstbearbeitung. 33 vols. Leipzig 1864-1971. Online-Version (OA): http://www.DWB.uni-trier.de Offline-Version: Der Digitale Grimm. Deutsches Wörterbuch der Brüder Grimm. Elektronische Ausgabe der Erstbearbeituung. Bearbeitet von Hans-Werner Bartz, Thomas Burch, Ruth Christmann, Kurt Gärtner,Vera Hildenbrandt, Thomas Schares und Klaudia Wegge. Hrsg. vom Kompetenzzentrum für elektronische Erschließungs- und Publikationsverfahren in den Geisteswissenschaften an der Universität Trier in Verbindung mit der Berlin-Brandenburgischen Akademie der Wissenschaften. 2 CD-ROMs, Benutzerhandbuch, Begleitbuch. 1. Auflage. Frankfurt am Main: Zweitausendeins 2004, 5. Auflage, 2007. The Online Version of the New MHG Dictionary Findebuch = Kurt Gärtner, Christoph Gerhardt, Jürgen Jaehrling, Ralf Plate, Walter Röll, Erika Timm (Datenverarbeitung: Gerhard Hanrieder): Findebuch zum mittelhochdeutschen Wortschatz. Mit einem rückläufigen Index. Stuttgart 1992. Fournier, Johannes: New Directions in Middle High German Lexicography: Dictionaries Interlinked Electronically. In: Literary and Linguistic Computing 16 (2001), p. 99-111. Gärtner, Kurt: Concordances and Indices to Middle High German. In: Computers and the Humanities 14 (1980), p. 39-45. Gärtner, Kurt: Comprehensive Digital Text Archives: A Digital Middle High German Text Archive and its Perspectives, in: First EU/NSF Digital Libraries All Projects Meeting, Rome March 25-26, 2002. http://delos-noe.iei.pi.cnr.it/activities/ internationalforum/All-Projects/us.html _____________________________________________________________________________ 123 Digital Humanities 2008 _____________________________________________________________________________ Gärtner, Kurt / Wisbey, Roy: Die Bedeutung des Computers für die Edition altdeutscher Texte. In: Kritische Bewahrung. Beiträge zur deutschen Philologie. Festschrift für Werner Schröder zum 60. Geburtstag. Hrsg. von E.-J. Schmidt. Berlin 1974, p. 344-356. Domestic Strife in Early Modern Europe: Images and Texts in a virtual anthology Gärtner, Kurt / Grubmüller, Klaus (Eds.): Ein neues Mittelhochdeutsches Wörterbuch: Prinzipien, Probeartikel, Diskussion (Nachrichten der Akademie der Wissenschaften in Göttingen. I. Philologisch-historische Klasse, Jg. 2000, Nr. 8). Göttingen 2000. Martin Holmes Gärtner, Kurt / Plate, Ralf / Recker, Ute: Textvorbereitung und Beleggewinnung für das Mittelhochdeutsche Wörterbuch. In: Literary and Linguistic Computing 14, No. 3 (1999), p. 417-423. Lexer = Lexer, Matthias: Mittelhochdeutsches Handwörterbuch. Nachdruck der Ausgabe Leipzig 1872-1878 mit einer Einleitung von Kurt Gärtner. 3 Bde. Stuttgart 1992. Lexer Nachträge = Matthias Lexer: Nachträge zum mittelhochdeutschen Handwörterbuch. Leipzig 1878. MHDTA/MHGTA = Final report on the Project „Digital Middle High German Text Archive“ to the DFG and NSF. Charlottesville / Trier 2004. http://www.mhgta.uni-trier. de/MHGTA_Final_Report.pdf MWB = Mittelhochdeutsches Wörterbuch Online-Version (OA): http://mhdwb-online.de Print-Version with CD: Mittelhochdeutsches Wörterbuch. Im Auftrag der Akademie der Wissenschaften und der Literatur Mainz und der Akademie der Wissenschaften zu Göttingen hrsg. von Kurt Gärtner, Klaus Grubmüller und Karl Stackmann. Erster Band, Lieferung 1ff. Stuttgart 2006ff. Project Homepage: http://www.mhdwb. uni-trier.de/ MWV = Mittelhochdeutsche Wörterbücher im Verbund. [Middle High German Dictionaries Interlinked] OnlineVersion (OA): http://www.MWV.uni-trier.de Offline-Version: Mittelhochdeutsche Wörterbücher im Verbund. Hrsg. von Thomas Burch, Johannes Fournier, Kurt Gärtner. CD-ROM und Begleitbuch. Stuttgart 2002. Queens, Frank; Recker-Hamm, Ute: A Net-based Toolkit for Collaborative Editing and Publishing of Dictionaries. In: Literary and Linguistic Computing 20 (2005), p. 165-175. Recker, Ute: Digital Middle High German Text Archive. In: Burch / Fournier / Gärtner / Rapp 2002, p. 307-309. Wisbey, Roy: Computer und Philologie in Vergangenheit, Gegenwart und Zukunft. In: Maschinelle Verarbeitung altdeutscher Texte IV. Beiträge zum Vierten Internationalen Symposion, Trier 28. Februar bis 2. März 1988. Hrsg. von Kurt Gärtner, Paul Sappler und Michael Trauth. Tübingen 1991, p. 346-361. Wörterbuch-Netz: http://www.woerterbuchnetz.de mholmes@uvic.ca University of Victoria, Canada Claire Carlin ccarlin@uvic.ca University of Victoria, Canada The seventeenth-century engravings and texts collected for the project Le mariage sous l’Ancien Régime: une anthologie virtuelle all belong to the polemical genre. The institution of marriage was undergoing intense scrutiny and criticism in light of Reformation questioning of Catholic practices, and popular discourse and images reflected this malaise. The cuckold attacked both verbally and physically by a nagging wife, or, conversely, a sassy wife receiving her “correction” are themes of the Middle Ages reproduced frequently during the early modern period, but with new sophistication as engravings became the primary site for the representation of problems with marriage, as DeJean has shown. Whereas polemical writings flourished in the first half of the century according to Banderier and Carlin 2002, images gained more complexity in design and incorporated increasing amounts of commentary during the reign of Louis XIV (1661-1715). New variations on medieval topics occur, as bourgeois women in salons comment on marriage or wrestle with their husbands over the pants in the family. A novel twist on spousal conflict appears in engravings such as “L’invention des femmes” (Lagniet) and “Operateur céphalique” (Anon.): inspired by the Renaissance interest in dissection, the notion that behaviour could be modified through radical brain surgery introduces a new kind of violence into marriage satire. From the beginning of the project, our approach to construction of the corpus has been based on the ideal that texts and images must both be full and equivalent members of the collection. Images have been treated as texts, and our goal has been to make explicit, in textual annotations, as much of the significant information they encode as possible. In this respect, the Mariage project is very different from other markup projects which integrate images and texts, such as those described in Porter 2007; while projects such as Pembroke 25 (Szarmach & Hall n.d.) and the Electronic Aelfric integrate images and text in sophisticated ways, the images are of manuscript pages, and the at the heart of these digital editions is the process of transcription. By contrast, while the engravings in the Mariage collection may include fairly extensive textual components — see, for instance, “Le Fardeau du Menage” (Guérard 1712), which incorporates over 60 lines of verse — the text is ancillary to the scenes depicted. Thus, the engravings in the Mariage project are somewhere on a continuum between the page-images of a digital edition of a manuscript, and the images _____________________________________________________________________________ 124 Digital Humanities 2008 _____________________________________________________________________________ of artworks in a project such as the On-Line Picasso Project (Mallen n.d.). At the simplest level, marking up the images means transcribing and clarifying any text which actually appears on the engravings. This makes the text on the images as searchable as that of the poetry and prose. Following this, we have begun to identify and discuss the significance of individual figures in the engravings, as well as linking these figures to similar elements in other engravings, or to texts in the collection. Figure 1 shows this kind of linking in as it appears to the reader in a Web browser. The wife’s lover is a recurrent figure in images and texts of the period, and the annotation links to textual references, and also to a segment of another image. Figure 2: Results of a search, showing sections of images as well as a poem retrieved from the database. We are now marking up symbols and devices, and have begun to contemplate the possibility that we might mark up virtually every object which appears in the engravings. Although this would be rather laborious, it would enable the discovery of more correspondences and perhaps more symbolism than is currently apparent.To be able to search for “poule”, and retrieve all the depictions of hens which appear in the collection (as well as instances of the word itself), and view them together on the same page, would provide a powerful tool for researchers.This level of annotation (the identification and labelling of everyday objects) does not require any significant expertise, but it will contribute a great deal to the value of the collection. Figure 1: Linking between image areas and other texts. In addition, the search system on the site has been designed to retrieve both blocks of text and segments of images, based on annotations to the images. Figure 2 shows part of the result set from a search for “frapp*”, including sections of images and a line from a poem. In the case of an image hit, clicking on the link will lead to the annotated image with the “hit annotation” selected. Using the combination of textual and image markup, our team has been able to uncover links across the decades as well as slowly developing differences among diverse types of polemical images. Research questions we have begun to explore include the following: - How does the use of key vocabulary evolve over the sixteenth to eighteenth centuries? - How does the use of objects, especially the instruments of violence, change over this period? - How does the use of stock verses and verse forms develop over the course of this period? In addition to discussing the research tools and insights emerging out of the Mariage project, this presentation will look at the development of the Image Markup Tool, an open- _____________________________________________________________________________ 125 Digital Humanities 2008 _____________________________________________________________________________ source Windows application which was created initially for use in the Mariage project, but which is now used in a range of other projects as well. The IMT has acquired a number of new features in response to the requirements of the Mariage project, and is also developing in reaction to changes in TEI P5. The original approach of the IMT was to blend SVG with TEI to produce a dual-namespace document (Carlin, Haswell and Holmes 2006), but the incorporation of the new <facsimile>, <surface> and <zone> elements in the TEI transcr module (TEI Consortium 2007, 11.1) now provide a native TEI framework which appears suited to image markup projects such as Mariage, and in December 2007, a new version of the Image Markup Tool was released, which relinquishes the SVG-based approach in favour of a pure TEI schema. This simplification of the file format will, we hope, make it much easier for endusers to integrate markup produced using the IMT into Web applications and other projects. Further developments in the IMT are projected over the next few months. The current version of the application is really focused on marking up the single, standalone images which are central to the Mariage project. However, <facsimile> is described in the TEI Guidelines as containing “a representation of some written source in the form of a set of images rather than as transcribed or encoded text” (TEI Consortium 2007, 11.1). The IMT really ought to be capable of producing documents which contain multiple images, and we plan to extend it to make this possible. References DeJean, J. 2003. “Violent Women and Violence against Women: Representing the ‘Strong’ Woman in Early Modern France.” Signs, 29.1 (2003): 117-47. Guérard, N. 1712. “Le Fardeau du Menage”. Bibliothèque Nationale de France Cote EE 3A PET FOL, P. 7 (Cliché 06019648). [http://mariage.uvic.ca/xhtml.xq?id=fardeau] Accessed 2007-11-05. Holmes, M. 2007. Image Markup Tool v. 1.7. [http://www.tapor. uvic.ca/~mholmes/image_markup/] Accessed 2007-11-05. Lagniet, J. ca. 1660. “L’Invention des femmes.” Bibliothèque Nationale de France, Cote TE 90 (1) FOL (Cliché 06019644). [http://mariage.uvic.ca/xhtml.xq?id=invention_femmes_1] Accessed 2007-11-05. Mallen, E., ed. n.d. The Picasso Project. Texas A&M University. [http://picasso.tamu.edu/] Accessed 2007-11-05. Porter, D. C. 2007. “Examples of Images in Text Editing.” Digital Humanities 2007 Conference Abstracts. 159-160. Szarmach, Paul, and Thomas H. Hall. n.d. Digital Edition of Cambridge, Pembroke College MS 25. (Pembroke 25 project.) Cited in Porter 2007. TEI Consortium, eds. 2007. “Digital Facsimiles.” Guidelines for Electronic Text Encoding and Interchange. [Last modified date: 2007-10-28]. [http://www.tei-c.org/release/doc/tei-p5- doc/ en/html/PH.html] Accessed 2007-11-07. Anon. n.d. “Operateur Cephalique”. Bibliothèque Nationale de France Cote RES TF 7 4e (Cliché P123-IFFNo190). [http:// mariage.uvic.ca/xhtml.xq?id=operateur_cephalique] Accessed 2007-11-05. Banderier, G. “Le mariage au miroir des poètes satiriques français (1600-1650).” In Le mariage dans l’Europe des XVIe et XVIIe siècles: réalités et représentations, vol. 2. Ed. R. Crescenzo et al. Presses de l’Université Nancy II, 200. 243-60. Carlin, C. “Misogamie et misogynie dans les complaintes des mal mariés au XVIIe siècle.” In La Femme au XVIIe siècle. Ed. Richard G. Hodgson. Biblio 17, vol. 138. Tübingen: Gunter Narr Verlag, 2002. 365-78. Carlin, C., E. Haswell and M. Holmes. 2006. “Problems with Marriage: Annotating Seventeenthcentury French Engravings with TEI and SVG.” Digital Humanities 2006 Conference Abstracts. 39-42. [http://www.allc-ach2006.colloques.parissorbonne.fr/DHs.pdf]. July 2006. Accessed 2007-11-05. Carlin, C. et al. 2005-2007. Le mariage sous L’Ancien Régime: une anthologie virtuelle.[http://mariage.uvic.ca/] Accessed 2007-11-05. _____________________________________________________________________________ 126 Digital Humanities 2008 _____________________________________________________________________________ Rescuing old data: Case studies, tools and techniques Case Study #1:The Nxaʔamxcín (Moses) Dictionary Database Martin Holmes During the 1960s and 70s, the late M. Dale Kincaid did fieldwork with native speakers of Nxaʔamxcín (Moses), an aboriginal Salish language. His fieldnotes were in the form of a huge set of index cards detailing vocabulary items, morphology and sample sentences. In the early 1990s, the data from the index cards was digitized using a system based on a combination of Lexware and WordPerfect, running on DOS, with the goal of compiling a print dictionary. The project stalled after a few years, although most of the data had been digitized; the data itself was left stored on an old desktop computer. When this computer finally refused to boot, rescuing the data became urgent, and the WordPerfect files were retrieved from its hard drive. We then had to decide what could usefully be done with it. Luckily, we had printouts of the dictionary entries as processed by the Lexware/WordPerfect system, so we knew what the original output was supposed to look like. The data itself was in WordPerfect files. mholmes@uvic.ca University of Victoria, Canada Greg Newton gregster@uvic.ca University of Victoria, Canada Much of the work done by digital humanists could arguably be described as “rescuing old data”. Digitizing texts is a way of rescuing them from obsolescence, from obscurity, or from physical degradation. However, digitizing in itself does not amount to permanent rescue; it is merely transformation, and digital texts have turned out to have lifetimes vastly shorter than printed texts. As Besser pointed out in 1999, “Though most people tend to think that (unlike analog information) digital information will last forever, we fail to realize the fragility of digital works. Many large bodies of digital information (such as significant parts of the Viking Mars mission) have been lost due to deterioration of the magnetic tapes that they reside on. But the problem of storage media deterioration pales in comparison with the problems of rapidly changing storage devices and changing file formats. It is almost impossible today to read files off of the 8-inch floppy disks that were popular just 20 years ago, and trying to decode Wordstar files from just a dozen years ago can be a nightmare. Vast amounts of digital information from just 20 years ago is, for all practical purposes, lost.” As Kirschenbaum (2007) says, “The wholesale migration of literature to a born-digital state places our collective literary and cultural heritage at real risk.” To illustrate this point, consider that searches for the original news source of Besser’s claim about the Viking Mars mission reveal links to articles in Yahoo, Reuters, and Excite, but all of these documents are now unavailable. Only the Internet Archive has a copy of the story (Krolicki 2001). This paper will examine two cases in which we have recently had to rescue digitized content from neardeath, and discuss some tools and techniques which we developed in the process, some of which are available for public use. Our experiences have taught us a great deal, not just about how to go about retrieving data from obsolete formats, but also about how better to protect the data we are generating today from obsolescence. Our paper is not intended to address the issue of digital preservation (how best to prevent good digital data from falling into obsolescence); an extensive literature already exists addressing this problem (Vogt-O’Connor 1999, Besser 1999, and others). The fact is that, despite our best efforts, bit-rot is inexorable, and software and data formats always tend towards obsolescence. We are concerned, here, with discussing the techniques we have found effective in rescuing data which has already become difficult to retrieve. The natural approach to converting this data would be to open the files in an more recent version of WordPerfect, or failing that, a version contemporary with the files themselves. However, this was not effective, because the files themselves were unusual. In addition to the standard charset, at least two other charsets were used, along with an obsolete set of printer fonts, which depended on a particular brand of printer, and a specific Hercules graphics card. In addition, the original fonts were customized to add extra glyphs, and a range of macros were used. In fact, even the original authors were not able to see the correct characters on screen as they worked; they had to proof their work from printouts. When the files were opened in WordPerfect, unusual characters were visible, but they were not the “correct” characters, and even worse, some instances of distinct characters in the original were collapsed into identical representations in WordPerfect. Attempts to open the files in other word processors failed similarly. Another obvious approach would be to use libwpd, a C++ library designed for processing WordPerfect documents, which is used by OpenOffice.org and other word-processing programs. This would involve writing handlers for the events triggered during the document read process, analysing the context and producing the correct Unicode characters. Even given the fact that libwpd has only “Support for a substantial portion of the WordPerfect extended character set”, this technique might well have succeeded, but the tool created as a result would have been specific only to WordPerfect files, and to this project in particular.We decided that with a similar investment of time, we would be able to develop a more generally-useful tool; in particular, we wanted to create a tool which could be used by a non-programmer to do a similar task in the future. _____________________________________________________________________________ 127 Digital Humanities 2008 _____________________________________________________________________________ Comparing the contents of the files, viewed in a hex editor, to the printouts, we determined that the files consisted of: From here, our task was in two stages: to go from the hatnotation data to a Unicode representation, like this: -blocks of binary information we didn’t need (WordPerfect file headers) -blocks of recognizable text -blocks of “encoded text”, delimited by non-printing characters The control characters signal switches between various WordPerfect character sets, enabling the printing of nonascii characters using the special printer fonts. Our task was to convert this data into Unicode. This was essentially an enormous search-and-replace project. Here is a sample section from the source data: and thence to a TEI XML representation: (where [EOT] = end of transmission, [BEL] = bell, [SOH] = start of header, [SO] = shift out, and [US] = unit separator). This image shows the original print output from this data, including the rather cryptic labels such as “11tr” which were to be used by Lexware to generate the dictionary structure: Working with the binary data, especially in the context of creating a search-and-replace tool, was problematic, so we transformed this into a pure text representation which we could work with in a Unicode text editor. This was done using the Linux “cat” command. The command “cat -v input_file > output_file” takes “input_file” and prints all characters, including non-printing characters, to “output_file”, with what the cat manual refers to as “nonprinting characters” encoded in “hat notation”. This took us from this: À[EOT][BEL]ÀÀ1[SOH]ÀkÀ[SO][EOT]À to this: In the first stage, we established a table of mappings between control character sequences and Unicode sequences. However, dozens of such sequences in our data contained overlaps; one sequence mapping to one character might appear as a component of a larger sequence mapping to a different character. In other words, if we were to reconstruct the data using search-and-replace operations, the order of those operations would be crucial; and in order to fix upon the optimal progression for the many operations involved, some kind of debugging environment would be needed. This gave rise to the Windows application Transformer, an open-source program designed for Unicodebased search-andreplace operations. It provides an environment for specifying, sequencing and testing multiple search-and-replace operations on a text file, and then allows the resulting sequence to be run against a batch of many files. This screenshot shows Transformer at work on one of the Moses data files. M-@^D^GM-@M-@1^AM-@kM-@^N^DM-@ _____________________________________________________________________________ 128 Digital Humanities 2008 _____________________________________________________________________________ normally leading to a decision or response by a government minister. These documents are held in archives in BC, Toronto and in the UK, and were transcribed both from originals and from microfilm. It would be difficult to overestimate the historical significance of this digital archive, and also its contemporary relevance to the treaty negotiations which are still going on between First Nations and the BC and Canadian governments. The transcriptions take the form of about 9,000 text files in Waterloo SCRIPT, a markup language used primarily for preparing print documents. 28 volumes of printed text were generated from the original SCRIPT files, and several copies still exist.The scale of the archive, along with the multithreaded and intermittent nature of the correspondence, delayed as it was by the lengthy transmission times and the range of different government offices and agents who might be involved in any given issue, make this material very well-suited to digital publication (and rather unwieldy and difficult to navigate in print form). Our challenge is converting the original script files to P5 XML. The ability to test and re-test replacement sequences proved crucial, as we discovered that the original data was inconsistent. Data-entry practices had changed over the lifetime of the project. By testing against a range of files from different periods and different data-entry operators, we were able to devise a sequence which produced reliable results across the whole set, and transform all the data in one operation. This is an example of the Waterloo Script used in this project: Having successfully created Unicode representations of the data, we were now able to consider how to convert the results to TEI XML. The dataset was in fact hierarchically structured, using a notation system which was designed to be processed by Lexware. First, we were able to transform it into XML using the Lexware Band2XML converter (http://www. ling.unt.edu/~montler/convert/Band2xml.htm); then an XSLT transformation took us to TEI P5. The XML is now being proofed and updated, and an XML database application is being developed. Case Study #2:The Colonial Despatches project During the 1980s and 1990s, a team of researchers at the University of Victoria, led by James Hendrickson, transcribed virtually the entire correspondence between the colonies of British Columbia and Vancouver Island and the Colonial Office in London, from the birth of the colonies until their incorporation into the Dominion of Canada.These documents include not only the despatches (the spelling with “e” was normal in the period) between colonial governors and the bureacracy in London; each despatch received in London went through a process of successive annotation, in the form of bureaucratic “minutes”, through which its significance was discussed, and appropriate responses or actions were mooted, We can see here the transcribed text, interspersed with structural milestones (.par;), editorial annotations such as index callouts and footnote links, and other commands such as .adr, which are actually user-defined macros that invoke sets of formatting instructions, but which are useful to us because they help identify and categorize information (addresses, dates, etc.). Although this is structured data, it is far from from ideal. It is procedural rather than hierarchical (we can see where a paragraph begins, but we have to infer where it ends); it is only partially descriptive; and it mixes editorial content with transcription. This is rather more of a challenge than the Moses data; it is much more varied and more loosely structured. Even if a generic converter for Waterloo SCRIPT files were available (we were unable to find any working converter which produced useful output such as XML), it would essentially do no more than produce styling/printing instructions in a different format; it would not be able to infer the document structure, or convert _____________________________________________________________________________ 129 Digital Humanities 2008 _____________________________________________________________________________ (say) a very varied range of handwritten date formats into formal date representations.To create useful TEI XML files, we need processing which is able to make complex inferences from the data, in order to determine for instance where for example an <opener> begins and ends; decide what constitutes a <salute></salute>; or parse a human-readable text string such as “12719, CO 60/1, p. 207; received 14 December” into a structured reference in XML. The most effective approach here was to write routines specific to these texts and the macros and commands used in them. As in the Moses project, we used the technique of stringing together a set of discrete operations, in an environment where they could be sequenced, and individual operations could be turned on and off, while viewing the results on individual texts. We gutted the original Transformer tool to create a shell within which routines programmed directly into the application were used in place of search-and-replace operations. The resulting interface provides the same options for sequencing and suppressing operations, and for running batch conversions on large numbers of files, but the underlying code for each operation is project-specific, and compiled into the application. This screenshot shows the application at work on a despatch file: width, and so on, designed to lay them out effectively on a page printed by a monospace printer. The specific formatting information was not representative of anything in the original documents (the original documents are handwritten, and very roughly laid out); rather, it was aimed specifically at the print process. In cases like this, it was more efficient simply to encode the tables manually. The project collection also contains some unencoded transcription in the form of Word 5 documents.To retrieve this data and other similar relics, we have built a virtual machine running Windows 3.11, and populated it with DOS versions of Word, WordStar and WordPerfect. This machine also provides an environment in which we can run the still-famous TACT suite of DOS applications for text-analysis.The virtual machine can be run under the free VMWare Server application, and we hope to make it available on CD-ROM at the presentation. Conclusions The data files in both the projects we have discussed above were all in standard, documented formats. Nevertheless, no generic conversion tool or process could have been used to transform this data into XML, while preserving all of the information inherent in it. We were faced with three core problems: - Idiosyncracy. Waterloo SCRIPT may be well documented, but any given corpus is likely to use macros written specifically for it. WordPerfect files may be documented, but issues with character sets and obsolete fonts can render them unreadable. Every project is unique. - Inconsistency. Encoding practices evolve over time, and no project of any size will be absolutely consistent, or free of human error. The conversion process must be flexible enough to accommodate this; and we must also recognize that there is a point at which it becomes more efficient to give up, and fix the last few files or bugs manually. The bulk of the original SCRIPT files were converted in this way, at a “success rate” of around 98% --meaning that at least 98% of the files were converted to well-formed XML, and proved valid against the generic P5 schema used for the project. The remaining files (around 150) were partially converted, and then edited manually to bring them into compliance. Some of these files contained basic errors in their original encoding which precluded successful conversion. Others were too idiosyncratic and complex to be worth handling in code. For instance, 27 of the files contained “tables” (i.e. data laid out in tabular format), each with distinct formatting settings, tab - Non-explicit information. Much of the information we need to recover, and encode explicitly, is not present in a mechanical form in the original document. For example, only context can tell us that the word “Sir” constitutes a <salute>; this is evident to a human reader of the original encoding or its printed output, but not obvious to a conversion processor, unless highly specific instructions are written for it. In Transformer, we have attempted to create a tool which can be used for many types of textual data (and we have since used it on many other projects). Rather than create a customized conversion tool for a single project, we have tried to create an environment for creating, testing and applying conversion scenarios. We are currently planning to add scripting support to the application. For the Colonial Despatches project, we resorted to customizing the application by adding specific application code to accomplish the conversion. A scripting _____________________________________________________________________________ 130 Digital Humanities 2008 _____________________________________________________________________________ feature in Tranformer would have enabled us to write that code in script form, and combine it with conventional searchand-replace operations in a single process, without modifying the application itself. Although this paper does not focus on digital preservation, it is worth noting that once data has been rescued, every effort should be made to encode and store it in such a way that it does not require rescuing again in the future. Good practices include use of standard file formats, accompanying documentation, regular migration as described in Besser (1999), and secure, redundant storage. To these we would add a recommendation to print out all your data; this may seem excessive, but if all else fails, the printouts will be essential, and they last a long time. Neither of the rescue projects described above would have been practical without access to the printed data. Dynamic rendering systems (such as Web sites that produce PDFs or HTML pages on demand, from database back-ends) should be able to output all the data in the form of static files which can be saved.The dynamic nature of such repositories is a great boon during development, and especially if they continue to grow and to be edited, but one day there may be no PHP or XSLT processor that can generate the output, and someone may be very glad to have those static files. We would also recommend creating virtual machines for such complex systems; if your project depends on Tomcat, Cocoon and eXist, it will be difficult to run when there are no contemporary Java Virtual Machines. The Nxaʔamxcín (Moses) Dictionary Database. <http:// lettuce.tapor.uvic.ca/cocoon/projects/moses/> Vogt-O’Connor, Diane. 1999. “Is the Record of the 20th Century at Risk?” CRM: Cultural Resource Management 22, 2: 21–24. References Besser, Howard. 1999. “Digital longevity.” In Maxine Sitts (ed.) Handbook for Digital Projects: A Management Tool for Preservation and Access, Andover MA: Northeast Document Conservation Center, 2000, pages 155-166. Accessed at <http://www.gseis. ucla.edu/~howard/Papers/sfs-longevity.html>, 2007-11-09. Holmes, Martin. 2007. Transformer. <http://www.tapor.uvic. ca/~mholmes/transformer/> Hsu, Bob. n.d. Lexware. Kirschenbaum, Matthew. 2007. “Hamlet.doc? Literature in a Digital Age.” The Chronicle of Higher Education Review, August 17, 2007. Accessed at <http://chronicle.com/free/v53/i50/ 50b00801.htm>, 2007-11-09. Krolicki, Kevin. 2001. “NASA Data Point to Mars ‘Bugs,’ Scientist Says.” Yahoo News (from Reuters). Accessed at <http://web.archive.org/web/20010730040905/http:// dailynews.yahoo.com/h/nm/20010727/ sc/space_mars_life_ dc_1.html>, 2007-11-09. Hendrickson, James E. (editor). 1988. Colonial Despatches of British Columbia and Vancouver Island. University of Victoria. libwpd - a library for importing WordPerfect (tm) documents. <http://libwpd.sourceforge.net/> _____________________________________________________________________________ 131 Digital Humanities 2008 _____________________________________________________________________________ Digital Editions for Corpus Linguistics: A new approach to creating editions of historical manuscripts Alpo Honkapohja ahonkapo@welho.com University of Helsinki, Finland Samuli Kaislaniemi samuli.kaislaniemi@helsinki.fi University of Helsinki, Finland Ville Marttila Ville.Marttila@helsinki.fi University of Helsinki, Finland Introduction One relatively unexplored area in the expanding field of digital humanities is the interface between textual and contextual studies, namely linguistics and history. Up to now, few digital editions of historical texts have been designed with linguistic study in mind. Equally few linguistic corpora have been designed with the needs of historians in mind. This paper introduces a new interdisciplinary project, Digital Editions for Corpus Linguistics (DECL). The aim of the project is to create a model for new, linguistically oriented online digital editions of historical manuscripts. Digital editions, while on a theoretical level clearly more useful to scholars than print editions, have not yet become the norm for publishing historical texts. To some extent this can be argued to result from the lack of a proper publishing infrastructure and user-friendly tools (Robinson 2005), which limit the possibilities of individual scholars and small-scale projects to produce digital editions. The practical approach taken by the DECL project is to develop a detailed yet flexible framework for producing electronic editions of historical manuscript texts. Initially, the development of this framework will be based on and take place concurrently with the work on three digital editions of Late Middle and Early Modern English manuscript material. Each of these editions—a Late Medieval bilingual medical handbook, a family of 15thcentury culinary recipe collections, and a collection of early 17th-century intelligence letters—will serve both as a template for the encoding guidelines for that particular text type and as a development platform for the common toolset. Together, the toolset and the encoding guidelines are designed to enable editors of manuscript material to create digital editions of their material with reasonable ease. Theoretical background The theoretical basis of the project is dually grounded in the fields of manuscript studies and corpus linguistics. The aim of the project is to create a solid pipeline from the former to the latter: to facilitate the representation of manuscript reality in a form that is amenable to corpus linguistic study. Since context and different kinds of metatextual features are important sources of information for such fields as historical pragmatics, sociolinguistics and discourse analysis, the focus must be equally on document, text and context. By document we refer to the actual manuscript, by text to the linguistic contents of the document, and by context to both the historical and linguistic circumstances relating to text and the document. In practice this division of focus means that all of these levels are considered equally important facets of the manuscript reality and are thus to be represented in the edition. On the level of the text, DECL adopts the opinion of Lass (2004), that a digital edition should preserve the text as accurately and faithfully as possible, convey it in as flexible a form as possible, and ensure that any editorial intervention remains visible and reversible.We also adopt a similar approach with respect to the document and its context: the editorial framework should enable and facilitate the accurate encoding and presentation of both the diplomatic and bibliographical features of the document, and the cultural, situational and textual contexts of both the document and the text. In keeping with the aforementioned aims, the development of both the editions, and the tools and guidelines for producing them, will be guided by the following three principles. Flexibility The editions seek to offer a flexible user interface that will be easy to use and enable working with various levels of the texts, as well as selecting the features of the text, document and context that are to be included in the presentation or analysis of the text. All editions produced within the framework will build on similar logic and general principles, which will be flexible enough to accommodate the specific needs of any text type. Transparency The user interfaces of the editions will include all the features that have become expected in digital editions. But in addition to the edited text and facsimile images of the manuscripts, the user will also be able to access the raw transcripts and all layers of annotation. This makes all editorial intervention transparent and reversible, and enables the user to evaluate any editorial decisions. Expandability The editions will be built with future expansion and updating in mind.This expandability will be three-dimensional in the sense that new editions can be added and linked to existing ones, and both new documents and new layers of annotation or _____________________________________________________________________________ 132 Digital Humanities 2008 _____________________________________________________________________________ information can be added to existing editions. Furthermore, the editions will not be hardwired to a particular software solution, and their texts can be freely downloaded and processed for analysis with external software tools. The editions will be maintained on a web server and will be compatible with all standards-compliant web browsers. The first DECL editions are being compiled at the Research Unit for Variation, Contacts and Change in English (VARIENG) at the University of Helsinki, and will form the bases for three doctoral dissertations. These editions, along with a working toolset and guidelines, are scheduled to be available within the next five years. Technical methods Since the aim of the DECL project is to produce an open access model for multipurpose and multidisciplinary digital editions, both the editions created by the DECL project and the tools and guidelines used in their production will be published online under an open access license. While the project strongly advocates open access publication of scholarly work, it also acknowledges that this may not be possible due to ongoing issues with copyright, for example in the case of facsimile images. Following the aforementioned principles, the electronic editions produced by the project will reproduce the features of the manuscript text as a faithful diplomatic transcription, into which linguistic, palaeographic and codicological features will be encoded, together with associated explanatory notes elucidating the contents and various contextual aspects of the text.The encoding standard used for the editions will be based on and compliant with the latest incarnation of the TEI XML standard (P5, published 1.11.2007), with any text-type specific features incorporated as additional modules to the TEI schema. The XML-based encoding will enable the editions to be used with any XML-aware tools and easily converted to other document or database standards. In addition to the annotation describing the properties of the document, text and context, further layers of annotation—e.g. linguistic analysis—can be added to the text later on utilising the provisions made in the TEI P5 standard for standoff XML markup. The editorial strategies and annotation practices of the three initial editions will be carefully coordinated and documented to produce detailed guidelines, enabling the production of further compatible electronic editions. The tools developed concurrently with and tested on the editions themselves will make use of existing open source models and software projects—such as GATE or Heart of Gold, teiPublisher and Xaira—to make up a sophisticated yet customisable annotation and delivery system.The TEI-based encoding standard will also be compatible with the ISO/TC 37/SC 4 standard, facilitating the linguistic annotation of the text. Expected results One premise of this project is that creating digital editions based on diplomatic principles will help raise the usefulness of digitised historical texts by broadening their scope and therefore also interest in them. Faithful reproduction of the source text is a requisite for historical corpus linguistics, but editions based on diplomatic transcripts of manuscript sources are equally amenable to historical or literary enquiry. Combining the approaches of different disciplines—historical linguistics, corpus linguistics, history—to creating electronic text databases should lead to better tools for all disciplines involved and increase interdisciplinary communication and cooperation. If they prove to be successful, the tools and guidelines developed by DECL will also be readily applicable to the editing and publication of other types of material, providing a model for integrating the requirements and desires of different disciplines into a single solution. The DECL project is also intended to be open in the sense that participation or collaboration by scholars or projects working on historical manuscript materials is gladly welcomed. References DECL (Digital Editions for Corpus Linguistics). <http://www. helsinki.fi/varieng/domains/DECL.html>. GATE (A General Architecture for Text Engineering). <http:// gate.ac.uk>. Accessed 15 November 2007. Heart of Gold. <http://www.delph-in.net/heartofgold/>. Accessed 23 November 2007. ISO/TC 37/SC 4 (Language resource management). <http://www.iso.org/iso/iso_technical_committee. html?commid=297592>. Accessed 15 November 2007. Lass, Roger. 2004. “Ut custodiant litteras: Editions, Corpora and Witnesshood”. Methods and Data in English Historical Dialectology, ed. Marina Dossena and Roger Lass. Bern: Peter Lang, 21–48. [Linguistic Insights 16]. Robinson, Peter. 2005. “Current issues in making digital editions of medieval texts—or, do electronic scholarly editions have a future?”. Digital Medievalist 1:1 (Spring 2005). <http://www.digitalmedievalist.org>. Accessed 6 September 2006. TEI (Text Encoding Initiative). <http:/www.tei-c.org>. Accessed 15 November 2007. teiPublisher. <http://teipublisher.sourceforge.net/docs/index. php>. Accessed 15 November 2007. VARIENG (Research Unit for Variation, Contacts and Change in English). <http://www.helsinki.fi/varieng/>. Xaira (XML Aware Indexing and Retrieval Architecture). <http://www.oucs.ox.ac.uk/rts/xaira/>. Accessed 15 November 2007. _____________________________________________________________________________ 133 Digital Humanities 2008 _____________________________________________________________________________ The Moonstone and The Coquette: Narrative and Epistolary Styles David L. Hoover david.hoover@nyu.edu New York University, USA In his seminal work on Austen, John F. Burrows demonstrates that characters can be distinguished from each another on the basis of the frequencies of the most frequent words of their dialogue treating the characters as if they were authors (1987). Computational stylistics has also been used to study the distinctive interior monologues of Joyce’s characters in Ulysses (McKenna and Antonia 1996), the styles of Charles Brockden Brown’s narrators (Stewart 2003), style variation within a novel (Hoover 2003), and the interaction of character separation and translation (Rybicki 2006). Here I address two kinds of local style variation: the multiple narrators of Wilkie Collins’s The Moonstone (1868) and the multiple letter writers in Hannah Webster Foster’s sentimental American epistolary novel The Coquette (1797). The Moonstone has several narrators whose styles seem intuitively distinct, though all the narrations share plot elements, characters, and physical and cultural settings. The Coquette, based on an infamous true story, and very popular when it was published, is read today more for its cultural significance and its proto-feminist tendencies than for its literary merit. Nevertheless, one would expect the coquette, the evil seducer, the virtuous friend, and the disappointed suitor to write distinctively. Treating the narrators and letter writers of these two novels as different authors will test how successfully Collins and Foster distinguish their voices and shed light on some practical and theoretical issues of authorship and style. novels confirm that Collins is easily distinguished from five of his contemporaries, as shown in Fig. I. For The Coquette, I have made the task more difficult by comparing Foster’s letter writers to those of five other late 18th-century epistolary novels by five authors. I separated all the letters by writer and addressee and retaining only the 22 writers with the most letters, then combined the letters between each single writer and addressee and cut the combined texts into 42 sections of about 3,500 words. Both Delta and cluster analysis do a good job on this difficult task, and many Delta analyses give completely correct results. Cluster analysis is slightly less accurate, but several analyses are correct for five of the six authors; they also show that individual letter writers strongly tend to group together within the cluster for each novel (see Fig. 2). Examining the individual novels reveals a sharp contrast: Collins’s narrators are internally consistent and easy to distinguish, while Foster’s letter writers are much less internally consistent and much more difficult to distinguish. For Collins, cluster analysis consistently groups all narrative sections of 6 of the 7 narrators. When a modified technique developed especially for investigating intra-textual style variation is used (Hoover 2003), the results are even better. As Fig. 3 shows, all sections by all narrators sometimes cluster correctly (the sections range from 4,300 to 6,900 words). “Tuning” the analysis to produce better clustering may seem circular in an analysis that tests whether Collins’s narrators have consistent idiolects. But this objection can be answered by noting the consistency of the groupings. The stable clustering across large ranges of analyses with different numbers of MFW that is found here is obviously more significant than frequently-changing clustering. Computational stylistics cannot be applied to narrators and letter writers of these novels, however, unless they can distinguish Collins and Foster from their contemporaries. Experiments on a 10-million word corpus of forty-six Victorian Note also that the sections strongly tend to occur in narrative order in Fig. 3: every two-section cluster consists of contiguous sections. This “echo” of narrative structure provides further evidence that the analysis is accurately characterizing the narrators’ styles. _____________________________________________________________________________ 134 Digital Humanities 2008 _____________________________________________________________________________ could argue that Boyer’s letters to Selby should be different from his letters to Eliza, and perhaps it is appropriate that Eliza’s last section includes only despairing letters written after Boyer has rejected her.Yet such special pleading is almost always possible after the fact. Further, any suggestion that Boyer’s letters to Eliza should be distinct from those to Selby is impossible to reconcile with the fact that they cluster so consistently with Eliza’s early letters to Lucy. And if Eliza’s early and late letters should be distinct, it is difficult understand the clustering of the early letters with those of Boyer to Selby and the consistent clustering of the late letters with Lucy’s letters to Eliza. It is difficult to avoid the conclusion that Foster has simply failed to create distinct and consistent characters in The Coquette. A fine-grained investigation of The Moonstone involving smaller sections and including more narrators puts Collins’s ability to differentiate his narrators to a sterner test, but some cluster analyses are again completely correct.The distinctiveness of the narrators of The Moonstone is thus confirmed by computational stylistics: the narrators behave very much as if they were literally different authors. Given the length of the novel, the generic constraints of narrative, and the inherent similarities of plot and setting, this is a remarkable achievement. For my analysis of The Coquette, I have added 3 more sections of letters, so that there are 13 sections of approximately 3,000 words by 6 writers, with separate sections by a single writer to two different addressees. Although these sections should be considerably easier to group than those in the fine-grained analysis of The Moonstone, the results are not encouraging. Cluster analyses based on the 300-700 MFW produce very similar but not very accurate results, in all of which letters by Boyer and Eliza appear in both main clusters (see Fig. 4). Lucy’s letters to Eliza also cluster with the last section of Eliza’s letters to Lucy in all of these analyses a misidentification as strong as any correct one. In a real authorship attribution problem, such results would not support the conclusion that all of Boyer’s or Eliza’s letters were written by a single author. Perhaps one In contrast, Fanny Burney, whose Evelina is included in the novels compared with The Coquette, above, creates very distinct voices for the letter writers. Although only four of the writers have parts large enough for reasonable analysis, Evelina writes to both her adoptive father Mr. Villars and her friend Miss Mirvan, and Villars writes both to Evelina and to Lady Howard. One might expect significant differences between letters to these very different addressees. Evelina’s style might also be expected to change over the course of this bildungsroman. However, analyses of all 34 sections of letters from Evelina (approximately 2,500 words long), show that Burney’s characters are much more distinct and consistent than Foster’s, as a representative analysis shows (see Fig. 5). This dendogram also strongly reflects the narrative structure of the novel. Burney, like Collins, is very successful in creating distinctive voices for her characters. Criticism of Foster’s novel has paid little attention to the different voices of the characters, but what commentary there is does not suggest that the patterns shown above should have been predictable. Smith-Rosenberg, for example, suggests a contrast between Eliza and “the feminized Greek chorus of Richman, Freeman, and Eliza’s widowed mother, who, at the end, can only mouth hollow platitudes” (2003: 35). Although these women and Julia are often considered a monolithic group urging conventional morality, the distinctness of Julia’s _____________________________________________________________________________ 135 Digital Humanities 2008 _____________________________________________________________________________ sections, especially from Lucy’s (see Fig. 5) might suggest a reexamination of this notion, and might reveal how the styles of Julia and the other women are related to more significant differences of opinion, character, or principle. Various suggestions about changes in Eliza over the course of the novel might also benefit from a closer investigation of the language of the letters. Because Foster’s novel is of interest chiefly on cultural, historical, and political grounds rather than literary ones, however, such an investigation is more likely to advance the theory and practice of computational stylistics than the criticism of The Coquette. It is clear, at any rate, that computational stylistics is adequate to the task of distinguishing narrators and letter writers, so long as the author is adequate to the same task. References Burrows, J. (1987) Computation into Criticism. Oxford: Clarendon Press. Hoover, D. (2003) Multivariate Analysis and the Study of Style Variation’, LLC 18: 341-60. Smith-Rosenberg, C. (2003) Domesticating Virtue: Coquettes and Revolutionaries in Young America’, In M. Elliott and C. Stokes (eds),American Literary Studies: A Methodological Reader, New York: NewYork Univ. Press. Rybicki, J. (2006). “Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and its Two English Translations,” LLC 21:91-103. Stewart, L.”Charles Brockden Brown: Quantitative Analysis and Literary Interpretation,” LLC 2003 18: 129-38. Term Discovery in an Early Modern Latin Scientific Corpus Malcolm D. Hyman hyman@mpiwg-berlin.mpg.de Max Planck Institut für Wissenschaftsgeschichte, Germany This paper presents the results of a pilot project aimed at the development of automatic techniques for the discovery of salient technical terminology in a corpus of Latin texts. These texts belong to the domain of mechanics and date from the last quarter of the sixteenth and the first quarter of the seventeenth century, a period of intense intellectual activity in which engineers and scientists explored the limits of the Aristotelian and Archimedean paradigms in mechanics. The tensions that arose ultimately were resolved by the new “classical mechanics” inaugurated by Newton’s Principia in 1687 (cf. Damerow et al. 2004). The work presented here forms part of a larger research project aimed at developing new computational techniques to assist historians in studying fine-grained developments in longterm intellectual traditions, such as the tradition of Western mechanics that begins with the pseudo-Aristotelian Problemata Mechanica (ca. 330 B.C.E.).This research is integrated with two larger institutional projects: the working group “Mental Models in the History of Mechanics” at the Max Planck Institute for the History of Science in Berlin, and the German DFG-funded Collaborative Research Center (CRC) 644 “Transformations of Antiquity.” The purpose of this paper is to present initial results regarding the development of efficient methods for technical term discovery in Early Modern scientific Latin. The focus is on the identification of term variants and on term enrichment. The methodology employed is inspired by Jacquemin (2001), whose approach allows for the use of natural language processing techniques without the need for full syntactic parsing, which is currently not technically feasible for Latin. The present paper extends prior work in term discovery along two vectors. First, work in term discovery has primarily addressed the languages of Western Europe (especially English and French), with some work also in Chinese and Japanese. Latin presents some typological features that require modifications to established techniques. Chief among these is the rich inflectional morphology (both nominal and verbal) of Latin, which is a language of the synthetic type. Latin also exhibits non-projectivity, i.e. syntactic constituents may be represented non-continuously (with the intrusion of elements from foreign constituents). Although the non-projectivity of Renaissance Latin is considerably less than what is found in the artistic prose (and a fortiori poetry) of the Classical language (Bamman and Crane 2006), term detection must proceed within a framework that allows for both non-projectivity and (relatively) free word order within constituents. _____________________________________________________________________________ 136 Digital Humanities 2008 _____________________________________________________________________________ Second, researchers in the field of term discovery have focused almost exclusively on contemporary scientific corpora in domains such as biomedicine. In contemporary scientific literature, technical terms are characterized by a particularly high degree of denotative monosemicity, exhibit considerable stability, and follow quite rigid morphological, syntactic, and semantic templates. Although these characteristics are also applicable to the terminology of Latin scientific texts, they are applicable to a lesser degree. In other words, the distinction between technical terminology and ordinary language vocabulary is less clear cut than in the case of contemporary scientific and technical language. The lesser degree of monosemicity, stability, and structural rigidity of terminology holds implications for automatic term discovery in corpora earlier than the twentieth (or at least nineteenth) century. The corpus of Early Modern mechanics texts in Latin is welldesigned for carrying out experiments in adapting established techniques of term discovery to historical corpora. Mechanics is by this time a scientific discipline that possesses an extensive repertoire of characteristic concepts and terminology. Thus it is broadly comparable to contemporary scientific corpora, while still presenting unique features that merit special investigation. Several thousand pages of text are available in XML format, which have been digitized by the Archimedes Project, an international German/American digital library venture jointly funded by the DFG and NSF. It will be possible to extend future work to a multilingual context, by examining in addition closely-related vernacular works (in Italian, Spanish, and German) that are contemporary with the Latin corpus. (Some of these are translations and commentaries.) References Bamman, D. and G. Crane. 2006. The design and use of a Latin dependency treebank. In Proceedings of the TLT 2006, edd. J. Hajič and J. Nivre, Prague, pp. 67–78. Damerow, P., G. Freudenthal, P. McLaughlin, J. Renn, eds. 2004. Exploring the Limits of Preclassical Mechanics: A Study of Conceptual Development in Early Modern Science: Free Fall and Compounded Motion in the Work of Descartes, Galileo, and Beeckman. 2d ed. New York. Hyman, M.D. 2007. Semantic networks: a tool for investigating conceptual change and knowledge transfer in the history of science. In Übersetzung und Transformation, edd. H. Böhme, C. Rapp, and W. Rösler, Berlin, pp. 355–367. Jacquemin, C. 2001. Spotting and Discovering Terms through Natural Language Processing. Cambridge, MA. The set of technical terminology discovered by the methods presented in this paper is intended to further the computationally-assisted framework for exploring conceptual change and knowledge transfer in the history of science that has been described by Hyman (2007). This framework employs latent semantic analysis (LSA) and techniques for the visualization of semantic networks, allowing change in the semantic associations of terms to be studied within a historical corpus.The concluding section of the present paper will survey the applications of technical term discovery within historical corpora for the study of the conflict, competition, evolution, and replacement of concepts within a scientific discipline and will suggest potential applications for other scholars who are concerned with related problems. _____________________________________________________________________________ 137 Digital Humanities 2008 _____________________________________________________________________________ Markup in Textgrid Fotis Jannidis jannidis@linglit.tu-darmstadt.de Technische Universität Darmstadt, Germany Thorsten Vitt vitt@linglit.tu-darmstadt.de Technische Universität Darmstadt, Germany The paper will discuss the decisions in relation to markup which have been made in Textgrid. The first part of the paper will describe the functionality and principal architecture of Textgrid, the second part will discuss Textgrid’s baseline encoding. Textgrid is a modular platform for collaborative textual editing and a first building block for a community grid for the humanities. Textgrid consists of a toolkit for creating and working with digital editions and a repository offering storage, archiving and retrieval. Textgrid’s architecture follows a layered design, built for openness on all levels. At its base there is a middleware layer providing generic utilities to encapsulate and provide access to the data grid’s storage facilities as well as external archives. Additionally, indexing and retrieval facilities and generic services like authorisation and authentication are provided here. A service layer built on the middleware provides automated text processing facilities and access to semantic resources. Here, Textgrid offers domain-specific services like a configurable streaming editor or a lemmatizer which uses the dictionaries stored in Textgrid.All services can be orchestrated in workflows, which may also include external services. Every service deploys standard web service technologies. As well, tools in the service layer can work with both data managed by the middleware and data streamed in and out of these services by the caller, so they can be integrated with environments outside of Textgrid. The full tool suite of Textgrid is accessible via TextGridLab, a user interface based on Eclipse which, besides user interfaces to the services and search and management facilities for Textgrid’s content, also includes some primarily interactive tools. The user interface provides integrated access to the various tools: For example, an XML Editor, a tool to mark up parts of an image and link it to the text, and a dictionary service. From the perspective of the user, all these tools are part of one application. This software framework is completely based on plug-ins and thus reflects the other layers’ extensibility: it can be easily extended by plug-ins provided by third parties, and although there is a standalone executable tailored for the philologist users, TextGridLab’s plugins can be integrated with existing Eclipse installations, as well. Additionally, the general public may read and search publicized material by means of a web interface, without installing any specialized software. Designing this infrastructure it would have been a possibility to define one data format which can be used in all services including search and retrieval and publishing. Instead the designers chose a different approach: each service or software component defines its own minimal level of format restriction. The XML editor, which is part of the front end, is designed to process all files which are xml conform; the streaming editor service can handle any kind of file etc. The main reason for this decision was the experience of those people involved and the model of the TEI guidelines to allow users as much individual freedom to choose and use their markup as possible even if the success of TEI lite and the many project specific TEI subsets seem to point to the need for defining strict standards. But at some points of the project more restrictive format decisions had to be made. One of them was the result of the project’s ambition to make all texts searchable in a way which is more useful than a mere full text search. On the other hand it isn’t realistic to propose a full format which will allow all future editors, lexicographers and corpus designers to encode all features they are interested in. So Textgrid allows all projects to use whatever XML markup seems necessary but burdens the project with designing its own interface to these complex data structures. But in this form the project data are an island and there is no common retrieval possible.To allow a retrieval across all data in Textgrid which goes beyond the possibilities of a full text research, the Textgrid designers discussed several possibilities but finally settled down on a concept which relies very much on text types like drama, prose, verse, letter etc. and we differentiate between basic text types like verse and container text types like corpora or critical editions. Interproject search is enabled by transforming all texts into a rudimentary format which contains the most important information of the specific text type. This baseline encoding is not meant to be a core encoding which covers all important information of a text type but it is strictly functional. We defined three demands which should be met by the baseline encoding, which is meant to be a subset of the TEI: 1) Intelligent search. Including often used aspects of text types into the search we try to make text retrieval more effective. A typical example would be the ‘knowledge’ that a word is the lemma of a dictionary entry, so a search for this word would mark this subtree as a better hit than another where it is just part of a paragraph. 2) Representation of search results. The results of an interproject search have to be displayed in some manner which preserves some important aspects of the source texts. 3) Automatic reuse and further processing of text. A typical example for this would be the integration of a dictionary in a network of dictionaries. This aspect is notoriously _____________________________________________________________________________ 138 Digital Humanities 2008 _____________________________________________________________________________ underrepresented in most design decisions of modern online editions which usually see the publication as the natural goal of their project, a publication which usually only allows for reading and searching as the typical forms of text usage. Our paper will describe the baseline encoding format for some of the text types supported by Textgrid at the moment including the metadata format and discuss in what ways the three requirements are met by them. One of the aims of our paper is to put our arguments and design decisions up for discussion in order to test their validity. Another aim is to reflect on the consequences of this approach for others like the TEI, especially the idea to define important text types for the humanities and provide specific markup for them. Breaking down barriers: the integration of research data, notes and referencing in a Web 2.0 academic framework Ian R. Johnson johnson@acl.arts.usyd.edu.au University of Sydney, Australia In this paper I argue that the arbitrary distinction between bibliographic data and research data – which we see in the existence of specialised library catalogues and bibliographic systems on the one hand, and a multitude of ad hoc notes, digitized sources, research databases and repositories on the other – is a hangover from a simpler past, in which publication and bibliographic referencing was a well-defined and separate part of the research cycle. Today published material takes many different forms, from books to multimedia, digital artworks and performances. Research data results from collection and encoding of information in museums, archives, libraries and fieldwork, as well as output from analysis and interpretation. As new forms of digital publication appear, the boundary between published material and research data blurs. Given the right enabling structures (eg. peer review) and tools (eg. collaborative editing), a simple digitized dataset can become as valuable as any formal publication through the accretion of scholarship. When someone publishes their academic musings in a personal research blog, is it analogous to written notes on their desk (or desktop) or is it grey literature or vanity publishing? The drawing of a distinction between bibliographic references and other forms of research data, and the storing of these data in distinct systems, hinders construction of the linkages between information which lie at the core of Humanities research.Why on earth would we want to keep our bibliographic references separate from our notes and developing ideas, or from data we might collect from published or unpublished sources? Yet standalone desktop silos (such as EndNote for references, Word for notes and MSAccess for data) actively discourage the linking of these forms of information. Bespoke or ad hoc databases admirably (or less than admirably) fulfill the particular needs of researchers, but fail to connect with the wider world. These databases are often desktopbased and inaccessible to anyone but the user of their host computer, other than through sharing of copies (with all the attendant problems of redundancy, maintenance of currency and merging of changes).When accessible they often lack multiuser capabilities and/or are locked down to modification by a small group of users because of the difficulties of monitoring and rolling back erroneous or hostile changes. Even when _____________________________________________________________________________ 139 Digital Humanities 2008 _____________________________________________________________________________ accessible to the public, they are generally accessible through a web interface which allows human access but not machine access, and cannot therefore be linked programmatically with other data to create an integrated system for analyzing larger problems. For eResearch in the Humanities to advance, all the digital information we use – bibliographic references, personal notes, digitized sources, databases of research objects etc. – need to exist in a single, integrated environment rather than in separate incompatible systems. This does not of course mean that the system need be monolithic – mashups, portals and Virtual Research Environments all offer distributed alternatives, dependant on exposure of resources through feed and web services. The ‘silo’ approach to data is also breaking down with the stunning success of web-based social software such as the Wikipedia encyclopaedia or Del.icio.us social bookmarking systems.These systems demonstrate that – with the right level of control and peer review – it is possible to build substantial and highly usable databases without the costs normally associated with such resources, by harnessing the collaborative enthusiasm of large numbers of people for data collection and through data mining of collective behaviour. decentralized system. The next version will operate in a peerto-peer network of instances which can share data with one another and with other applications. The service at HeuristScholar.org is freely available for academic use and has been used to construct projects as varied as the University of Sydney Archaeology department website, content management for the Dictionary of Sydney project (a major project to develop an online historical account of the history of Sydney) and an historical event browser for the Rethinking Timelines project. To illustrate the potential of an integrated Web 2.0 approach to heterogeneous information, I will discuss Heurist (HeuristScholar.org) – an academic social bookmarking application which we have developed, which provides rich information handling in a single integrated web application – and demonstrate the way in which it has provided a new approach to building significant repositories of historical data. Heurist handles more than 60 types of digital entity (easily extensible), ranging from bibliographic references and internet bookmarks, through encyclopaedia entries, seminars and grant programs, to C14 dates, archaeological sites and spatial databases. It allows users to attach multimedia resources and annotations to each entity in the database, using private, public, and group-restricted wiki entries. Some entries can be locked off as authoritative content, others can be left open to all comers. Effective geographic and temporal contextualisation and linking between entities provides new opportunities for Humanities research, particularly in History and Archaeology. Heurist allows the user to digitize and attach geographic data to any entity type, to attach photographs and other media to entities, and to store annotated, date-stamped relationships between entities. These are the key to linking bibliographic entries to other types of entity and building, browsing and visualizing networks of related entities. Heurist represents a first step towards building a single point of entry Virtual Research Environment for the Humanities. It already provides ‘instant’ web services, such as mapping, timelines, styled output through XSLT and various XML feeds (XML, KML, RSS) allowing it to serve as one component in a _____________________________________________________________________________ 140 Digital Humanities 2008 _____________________________________________________________________________ Constructing Social Networks in Modern Ireland (C.1750-c.1940) Using ACQ Jennifer Kelly jennifer.kelly@nuim.ie National University of Ireland, Maynooth, Ireland John G. Keating john.keating@nuim.ie National University of Ireland, Maynooth, Ireland Introduction The Associational Culture in Ireland (ACI) project at NUI Maynooth explores the culture of Irish associational life from 1750 to 1940, not merely from the point of view of who, what, where and when, but also to examine the ‘hidden culture’ of social networking that operated behind many clubs and societies throughout the period. Recently commissioned government research on civic engagement and active citizenship in Ireland has highlighted the paucity of data available for establishing ‘trends in volunteering, civic participation, voting and social contact in Ireland’ (Taskforce on Active Citizenship, Background Working Paper, 2007, p. 2). The same research has also confirmed the importance in Ireland of informal social networking compared to many other economically developed countries (Report of the Taskforce on Active Citizenship, 2007).The objective of the ACI project is to provide a resource to enable scholars of Irish social and political life to reconstruct and highlight the role that the wider informal community information field played in the public sphere in Ireland from the mid-eighteenth century. The project will also provide long-term quantitative digital data on associational culture in Ireland which is compatible with sophisticated statistical analysis, thereby enabling researchers to overcome one of the hindrances of modern-day purpose based social surveys: the short timeframe of data currently available. Associational Culture and Social Networking All historians are aware of the importance of social networks that underpin the foundations of social and political life in the modern world (Clark, 2000; Putnam, 2000). However, given the often-transient nature of much of these networks, they can be quite difficult to reconstruct. One way to examine social networks is to trace them through the structures of a developing associational culture where the cultivation of social exclusivity and overlapping membership patterns provide insights into the wider organisation of civil society at local, regional and national levels. To this end, the ACI project mines a wide range of historical sources to piece together as comprehensive a view as possible of the various voluntary formal associations that existed in Ireland during the period c.1750-c.1940. The first phase of the project concentrated on collecting data on Irish associational culture and social networking in the period 1750-1820; the current phase centres on the period 1820-1880 and the third phase will focus on the years between 1880 and 1940. The research results so far testify to the vibrancy of associational activity in Ireland and already patterns in social networking are becoming apparent in different parts of the country. The results so far indicate that particular forms of associational culture were popular in different parts of the country from the mid-eighteenth century, which in turn produced their own particular social networking systems. In many respects these patterns were maintained into the nineteenth century with similar continuities in patterns of sociability even though these continuities were sometimes expressed through different organisations, e.g. the ubiquitous Volunteering movement of the later eighteenth century gave way to local Yeomanry units in the beginning of the nineteenth century. Associational culture among the urban middling strata also appeared to increase somewhat moving into the nineteenth century with the increasing appearance of charitable and temperance societies in different parts of the country. Software Development Given the vast range of data available in the sources and the multiplicity of question types posed by the findings, the organisation and dissemination of the research was one of the main priorities for the project. The desire to present a quantitative as well as qualitative profile of Irish associational culture, which can be expanded upon by future research, presented particular difficulties in terms of the production of the project’s findings. A multi-skilled, partnership approach was adopted, fusing historical and computer science expertise to digitally represent the data recovered from the sources and expose the resultant social networking patterns through the construction of an appropriate online database for the project. Encompassing over seventy individual data fields for each association, the online ACI database is fully searchable by organisation and by individual – the later feature in particular allowing social patterns behind Irish associational culture to be traced over the course of two centuries. Bradley (2005) points out XML is well suited to document oriented project materials (such as written text), whereas relational database implementations are better suited to data oriented project materials. Furthermore, he indicates that projects may often use both, given that textual materials often contain data materials that are more suited to relational models. He argues that even if one begins with a text, the materials may be data-orientated and better served by a relational model where certain characteristics, for example, linkages between data objects and powerful querying are more easily expressed using SQL (Structured Query Language) than XML _____________________________________________________________________________ 141 Digital Humanities 2008 _____________________________________________________________________________ searching facilities, for example, XPATH or XQUERY. It was necessary, for our project to use a relational model to encode data orientated materials, even though data were derived from newspaper articles which are typically more suited to encoding in XML. Our relational objects are densely linked, and it is often necessary to build SQL queries incorporating multiple joins across many database tables. The database contains over 100 individual database tables related to associations, members, sources, relationships, associate members, locations, etc. Authenticated data entry and access is available using online forms. Test, rollback and moderation facilities are available for different classes of user. Research queries may be formulated using a specifically designed English language-type formal grammar called Associational Culture Query Language (ACQL), which is parsed by the software system to extract and present information from the database – parsing involves evaluating ACQL and converting it into appropriately constructed SQL suitable for querying the database. Users may construct ACQL manually or they may use an online query builder (see Figure 1). The software engineering process began with all partners involved in a collaborative process concentrating on the construction of a Specification of Requirements (SR) document which essentially formed the contract between the ACI historians and software engineers – this key document was used to derive all phases of the software development process, for example, System Analysis, System Design Specifications, Software Specifications, and Testing and Maintenance. These phases were implemented using a rapid prototyping model, where successive passes through a designdevelopment-testing cycle provided new prototypes which could be evaluated against the SR document. As is typical of such development projects, there was some requirement drift, but the rapid prototyping approach ensured insignificant divergence between prototypes and expectations outlined in the SR document. In order to produce the SR document, the ACI historians composed a series of research questions, which were used to extract the information categories necessary for the construction of social networks. During the collaborative process, these research questions and sources were used to identify and model the relationships between the associations and individuals in the sources. A key requirement was that ACI historians wanted to be able to answer specific research questions related to social networking, for example,“Are there any illegal women’s associations/clubs active in Ireland after 1832?” It was necessary to revise these research questions during the Analysis phase, however, as it became apparent that the historians did not expect a yes/no answer to this question, but rather a list of the associations and clubs, if they existed. Most research questions went through a revision process where the cardinality or result context was explicit, i.e. “List the illegal women’s associations active in Ireland after 1832”. This research question could be reformulated in ACQL as follows: LIST THE BODIES (NAME) ( WHERE THE GENDER IS “Female” AND WHERE THE STATUS IS “Illegal” AND WHERE THE FOUNDING DATE IS GREATER THAN 1st OF January 1832 ) The parsing software then converts this ACQL query into the following SQL: SELECT BodyName.bodyName FROM Body, BodyGender, BodyName, BodyStatus WHERE (Body.idBody = BodyName. Body_idBody) AND (Body.idBody = BodyGender. Body_idBody AND BodyGender.gender =’Female’ AND (Body.idBody = BodyStatus. Body_idBody AND BodyStatus.typeOfStatus =’Illegal’ (Body.foundingDate > ‘1832-01-1’) ) ); AND This query is then executed by the Relational Database Management System (RDBMS) and the results are returned to the environment for presentation to the user. We examined over one hundred questions of this type ensuring that our relational model provided appropriate answers. We developed ACQL to map research questions into SQL queries, thereby removing the requirement for the user to have knowledge of the database, table joins, or even SQL. Researchers need not have an intimate knowledge of the relationship between the database tables and their associated fields to perform searches.Another advantage of this approach is that the underlying database structure could change and the users would not have to change the format of their queries implemented in ACQL. Ongoing and Future Developments The success of this project depends on the research community using and contributing to the information contained in the online database. We are heartened, however, by the reports of Warwick et al (2007) who have shown that when information aggregation sites have user-friendly interfaces, contain quality peer-reviewed information, and fit research needs, the research community scholars are more likely adopt the digital resource. We believe that ACQL is a crucial component in making this system usable by professional historians interested in social networking. In particular, as the volume of data within the database expands in digital format, the potential for developing further social analysis tools such as sociograms will be initiated. All in all, the ACI database, by providing quantitative and qualitative data on specific _____________________________________________________________________________ 142 Digital Humanities 2008 _____________________________________________________________________________ associations, regions, and groups of individuals, comparable with international data sources, will greatly aid historical and social science researchers to establish Irish trends in civic participation, social inclusion, marginalisation and grassroots organising in modern Ireland. References Bradley, J. (2005) Documents and Data: Modelling Materials for Humanities Research in XML and Relational Databases. Literary and Linguistic Computing,Vol. 20, No. 1. Clark, P. (2000) British clubs and societies, 1500-1800: the origins of an associational world. Oxford: Oxford University Press. Putnam, R. (2000) Bowling alone: the collapse and revival of American community. New York: Simon & Schuster. Taskforce on Active Citizenship (2007) Statistical Evidence on Active Citizenship in Ireland, Background Working Paper. Retrieved March 25, 2008 from the World Wide Web: http:// www.activecitizen.ie/UPLOADEDFILES/Mar07/Statistical%20 Report%20(Mar%2007).pdf. Taskforce on Active Citizenship (2007) Report of the Taskforce on Active Citizenship. Dublin: Secretariat of the Taskforce on Active Citizenship. Warwick, C., Terras, M., Huntington, P. and Pappa, N. (2007). If You Build It Will They Come? The LAIRAH Study: Quantifying the Use of Online Resources in the Arts and Humanities through Statistical Analysis of User Log Data. Literary and Linguistic Computing,Vol. 22, No. 1. pp. 1-18. Unnatural Language Processing: Neural Networks and the Linguistics of Speech William Kretzschmar kretzsch@uga.edu University of Georgia, USA The foundations of the linguistics of speech (i.e., language in use, what people actually say and write to and for each other), as distinguished from ?the linguistics of linguistic structure? that characterizes many modern academic ideas about language, are 1) the continuum of linguistic behavior, 2) extensive (really massive) variation in all features at all times, 3) importance of regional/social proximity to “shared” linguistic production, and 4) differential frequency as a key factor in linguistic production both in regional/social groups and in collocations in text corpora (all points easily and regularly established with empirical study using surveys and corpora, as shown in Kretzschmar Forthcoming a). Taken together, the basic elements of speech correspond to what has been called a ?complex system? in sciences ranging from physics to ecology to economics. Order emerges from such systems by means of self-organization, but the order that arises from speech is not the same as what linguists study under the rubric of linguistic structure.This paper will explore the relationship between the results of computational analysis of language data with neural network algorithms, traditionally accepted dialect areas and groupings, and order as it emerges from speech interactions. In both texts and regional/social groups, the frequency distribution of features (language variants per se or in proximal combinations such as collocations, colligations) occurs as the same curve: a ?power law? or asymptotic hyperbolic curve (in my publications, aka the ?A-curve?). Speakers perceive what is “normal” or “different” for regional/social groups and for text types according to the A-curve: the most frequent variants are perceived as “normal,” less frequent variants are perceived as “different,” and since particular variants are more or less frequent among different groups of people or types of discourse, the variants come to mark identity of the groups or types by means of these perceptions. Particular variants also become more or less frequent in historical terms, which accounts for what we call “linguistic change,” although of course any such “changes” are dependent on the populations or text types observed over time (Kretzschmar and Tamasi 2003). In both synchronic and diachronic study the notion of “scale” (how big are the groups we observe, from local to regional/ social to national) is necessary to manage our observations of frequency distributions. Finally, our perceptions of the whole range of “normal” variants (at any level of scale) create “observational artifacts.” That is, the notion of the existence of any language or dialect is actually an “observational artifact” that comes from our perceptions of the available variants (plus other information and attitudes), at one point in time and for a particular group of speakers, as mediated by the A-curve. _____________________________________________________________________________ 143 Digital Humanities 2008 _____________________________________________________________________________ The notion “Standard,” as distinct from “normal,” represents institutional agreement about which variants to prefer, some less frequent than the “normal” variants for many groups of speakers, and this creates the appearance of parallel systems for “normal” and “Standard.” The best contemporary model that accommodates such processing is connectionism, parallel processing according to what anthropologists call “schemas” (i.e., George Mandler’s notion of schemas as a processing mechanism, D’Andrade 1995: 122-126, 144-145). Schemas are not composed of a particular set of characteristics to be recognized (an object), but instead of an array of slots for characteristics out of which a pattern is generated, and so schemas must include a process for deciding what to construct. One description of such a process is the serial symbolic processing model (D’Andrade 1995: 136138), in which a set of logical rules is applied in sequence to information available from the outside world in order to select a pattern. A refinement of this model is the parallel distributed processing network, also called the connectionist network, or neural net (D’Andrade 1995: 138-141), which allows parallel operation by a larger set of logical rules. The logical rules are Boolean operators, whose operations can be observed, for example, in simulations that Kauffman (1996) has built based on networks of lightbulbs. Given a very large network of neurons that either fire or not, depending upon external stimuli of different kinds, binary Boolean logic is appropriate to model “decisions” in the brain which arise from the on/off firing patterns. Kauffman’s simulations were created to model chemical and biological reactions which are similarly binary, either happening or not happening given their state (or pattern) of activation, as the system cycles through its possibilities. The comparison yields similar results: as D’Andrade reports (1995: 139-140), serial processing can be “’brittle’--if the input is altered very slightly or the task is changed somewhat, the whole program is likely to crash” (or as Kauffman might say, likely to enter a chaotic state cycle), while parallel processing appears to be much more flexible given mixed or incomplete input or a disturbance to the system (or as Kauffman might say, it can achieve homeostatic order). Computational modeling of neural networks appears,then,to be an excellent match for analysis of language data. Unfortunately, results to date have often been disappointing when applied to geographic language variation (Nerbonne and Heeringa 2001, Kretzschmar 2006). Neural network analysis cannot be shown reliably to replicate traditional dialect patterns. Instead, self-organizational patterns yielded by neural net algorithms appear to respond only in a general way to assumed dialect areas, and often appear to be derived not from the data but from conditions of its acquisition such as “field worker” effects (Kretzschmar Forthcoming b). However, this paper will show, using results from experiments with an implementation of a Self-Organizing Map (SOM) algorithm (Thill, Kretzschmar, Casas, andYao Forthcoming), that application of the model from the linguistics of speech to computer neural network analysis of geographical language data can explain such anomalies. It is not the implementation of neural nets that is the problem, but instead lack of control over the scale of analysis, and of the non-linear distribution of the variants included in the analysis, that tends to cause the problems we observe. In the end, we still cannot validate traditional dialect areas from the data (because these areas were also derived without sufficient control over the dynamics of the speech model), but we can begin to understand more clearly how the results of neural network analysis do reveal important information about the distribution of the data submitted to them. References D’Andrade, Roy. 1995. The Development of Cognitive Anthropology. Cambridge: Cambridge University Press. Kauffman, Stuart. 1996. At Home in the Universe:The Search for the Laws of Self-Organization and Complexity. New York: Oxford University Press. Kretzschmar, William A., Jr. 2006. Art and Science in Computational Dialectology. Literary and Linguistic Computing 21: 399-410. Kretzschmar, William A., Jr. Forthcoming a. The Linguistics of Speech. Cambridge: Cambridge University Press. Kretzschmar, William A., Jr. Forthcoming b. The Beholder?s Eye: Using Self-Organizing Maps to Understand American Dialects. In Anne Curzan and Michael Adams, eds., Contours of English (Ann Arbor: University of Michigan Press). Kretzschmar, William A., Jr., and Susan Tamasi. 2003. Distributional Foundations for a Theory of Language Change. World Englishes 22: 377-401. Nerbonne, John, and Wilbert Heeringa. 2001. Computational Comparison and Classification of Dialects. Dialectologia et Geolinguistica 9: 69-83. Thill, J., W. Kretzschmar, Jr, I. Casas, and X.Yao. Forthcoming. Detecting Geographic Associations in English Dialect Features in North America with Self-Organising Maps. In SelfOrganising Maps: Applications in GI Science, edited by P. Agarwal and A. Skupin (London: Wiley). _____________________________________________________________________________ 144 Digital Humanities 2008 _____________________________________________________________________________ Digital Humanities ‘Readership’ and the Public Knowledge Project Caroline Leitch cmleitch@uvic.ca University of Victoria, Canada Ray Siemens siemens@uvic.ca University of Victoria, Canada Analisa Blake University of Victoria, Canada Karin Armstrong karindar@uvic.ca University of Victoria, Canada In addition to these findings, we also discovered that users’ experiences with the online reading tools was influenced by their existing research methods, their familiarity with online research, and their expectations of online publishing. While many respondents felt that the “information environment” created by the online tools was beneficial to their evaluation and understanding of the material, they also expressed some dissatisfaction with their experience. Some users questioned the relevance and usefulness of the contextual material retrieved by the online tools. Users were also concerned with the perceived credibility of research published online and the limited amount of freely available online material. The results of our study reveal both the potential strengths and perceived weaknesses of online reading environments. Understanding how users read and evaluate research materials, anticipating users’ expectations of the reading tools and resources, and addressing user concerns about the availability of online material will lead to improvements in the design and features of online publishing. John Willinsky john.willinsky@ubc.ca University of British Columbia, Canada, As the amount of scholarly material published in digital form increases, there is growing pressure on content producers to identify the needs of expert readers and to create online tools that satisfy their requirements. Based on the results of a study conducted by the Public Knowledge Project and introduced at Digital Humanities 2006 (Siemens,Willinsky and Blake), continued and augmented since, this paper discusses the reactions of Humanities Computing scholars and graduate students to using a set of online reading tools. Expert readers were asked about the value of using online tools that allowed them to check, readily, related studies that were not cited in the article; to examine work that has followed on the study reported in the article; to consider additional work that has been done by the same author; and to consult additional sources on the topic outside of the academic literature. In the course of this study, these domain-expert readers made it clear that reading tools could make a definite, if limited, contribution to their critical engagement with journal articles (especially if certain improvements were made). Their reactions also point to an existing set of sophisticated reading strategies common to most expert readers. Our findings indicate that online tools are of most value to expert readers when they complement and augment readers’ existing strategies. We have organized the results of the study around a number of themes that emerged during our interviews with domain expert readers as these themes speak to both readers’ existing reading processes and the potential value of the online reading tools. By entering user responses into a matrix, we have been able to measure user responses and track both negative and positive reactions to different aspects of the online reading tools. Bibliography Afflerbach Peter P. “The Influence of Prior Knowledge on Expert Readers’ Main Idea Construction Strategies.” Reading Research Quarterly 25.1 (1990): 31-46. Alexander, P. A. Expertise and Academic Development: A New Perspective on a Classic Theme. Invited keynote address to the Biennial meeting of the European Association for the Research on Learning and Instruction (EARLI). Padova, Italy. August 2003. Chen, S., F. Jing-Ping, and R. Macredie. “Navigation in Hypermedia Learning Systems: Experts vs. Novices.” Computers in Human Behavior 22.2 (2006): 251-266. Pressley, M., and P. Afflerbach. Verbal Protocols of Reading: The Nature of Constructively Responsive Reading. Hillsdale NJ: Erlbaum, 1995. Schreibman, Susan, Raymond George Siemens, and John M. Unsworth, eds. A Companion to Digital Humanities. Oxford, UK: Blackwell, 2004 Siemens, Ray, Analisa Blake, and John Willinsky. “Giving Them a Reason to Read Online: Reading Tools for Humanities Scholars.” Digital Humanities 2006. Paris. Stanitsek, Geog. “Texts and Paratexts in Media.” Trans. Ellen Klien. Critical Inquiry 32.1 (2005): 27-42. Willinsky, J. “Policymakers’ Use of Online Academic Research.” Education Policy Analysis Archives 11.2, 2003, <http:// epaa.asu.edu/epaa/v11n2/>. _____________________________________________________________________________ 145 Digital Humanities 2008 _____________________________________________________________________________ Willinsky, J., and M. Quint-Rapoport. When Complementary and Alternative Medicine Practitioners Use PubMed. Unpublished paper. University of British Columbia. 2007. Windschuttle, Keith. “Edward Gibbon and the Enlightenment.” The New Criterion 15.10. June 1997. <http:// www.newcriterion.com/archive/15/jun97/gibbon.htm>. Wineburg, S. “Historical Problem Solving: A Study of the Cognitive Processes Used in the Evaluation of Documentary and Pictorial Evidence.” Journal of Educational Psychology 83.1 (1991): 73-87. Wineburg, S. “Reading Abraham Lincoln: An Expert/Expert Study in the Interpretation of Historical Texts.” Cognitive Science 22.3 (1998): 319-346. Wyatt, D., M. Pressley, P.B. El-Dinary, S. Stein, and R. Brown. “Comprehension Strategies, Worth and Credibility Monitoring, and Evaluations: Cold and Hot Cognition When Experts Read Professional Articles That Are Important to Them.” Learning and Individual Differences 5 (1993): 49–72. Using syntactic features to predict author personality from text Kim Luyckx kim.luyckx@ua.ac.be University of Antwerp, Belgium Walter Daelemans walter.daelemans@ua.ac.be University of Antwerp, Belgium Introduction The style in which a text is written reflects an array of meta-information concerning the text (e.g., topic, register, genre) and its author (e.g., gender, region, age, personality). The field of stylometry addresses these aspects of style. A successful methodology, borrowed from text categorisation research, takes a two-stage approach which (i) achieves automatic selection of features with high predictive value for the categories to be learned, and (ii) uses machine learning algorithms to learn to categorize new documents by using the selected features (Sebastiani, 2002). To allow the selection of linguistic features rather than (n-grams of) terms, robust and accurate text analysis tools are necessary. Recently, language technology has progressed to a state of the art in which the systematic study of the variation of these linguistic properties in texts by different authors, time periods, regiolects, genres, registers, or even genders has become feasible. This paper addresses a not yet very well researched aspect of style, the author’s personality. Our aim is to test whether personality traits are reflected in writing style. Descriptive statistics studies in language psychology show a direct correlation: personality is projected linguistically and can be perceived through language (e.g., Gill, 2003; Gill & Oberlander, 2002; Campbell & Pennebaker, 2003). The focus is on extraversion and neuroticism, two of “the most salient and visible personality traits” (Gill, 2003, p. 13). Research in personality prediction (e.g., Argamon et al., 2005; Nowson & Oberlander, 2007; Mairesse et al., 2007) focuses on openness, conscientiousness, extraversion, agreeableness, and neuroticism. We want to test whether we can automatically predict personality in text by studying the four components of the Myers-Briggs Type Indicator: Introverted-Extraverted, Intuitive-Sensing, Thinking-Feeling, and Judging-Perceiving. We introduce a new corpus, the Personae corpus, which consists of Dutch written language, while other studies focus on English. Nevertheless, we believe our techniques to be transferable to other languages. _____________________________________________________________________________ 146 Digital Humanities 2008 _____________________________________________________________________________ Related Research in Personality Prediction Most of the research in personality prediction involves the Five-Factor Model of Personality: openness, conscientiousness, extraversion, agreeableness, and neuroticism. The so-called Big Five have been criticized for their limited scope, methodology and the absence of an underlying theory. Argamon et al. (2005) predict personality in student essays using functional lexical features.These features represent lexical and structural choices made in the text. Nowson & Oberlander (2007) perform feature selection and training on a small and clean weblog corpus, and test on a large, automatically selected corpus. Features include n-grams of words with predictive strength for the binary classification tasks. Openness is excluded from the experiments because of the skewed class distribution. While the two studies mentioned above took a bottom-up approach, Mairesse et al. (2007) approach personality prediction from a top-down perspective. On a written text corpus, they test the predictive strength of linguistic features that have been proposed in descriptive statistics studies. Corpus Construction Our 200,000-word Personae corpus consists of 145 BA student essays of about 1,400 words about a documentary on Artificial Life in order to keep genre, register, topic and age relatively constant. These essays contain a factual description of the documentary and the students’ opinion about it. The task was voluntary and students producing an essay were rewarded with two cinema tickets. They took an online MBTI test and submitted their profile, the text and some user information.All students released the copyright of their text to the University of Antwerp and explicitly allowed the use of their text and personality profile for research, which makes it possible to distribute the corpus. The Myers-Briggs Type Indicator (Myers & Myers, 1980) is a forced-choice test based on Jung’s personality typology which categorizes a person on four preferences: • Introversion and Extraversion (attitudes): I’s tend to reflect before they act, while E’s act before they reflect. • iNtuition and Sensing (information-gathering): N’s rely on abstract or theoretical information, while S’s trust information that is concrete. • Feeling and Thinking (decision-making): While F’s decide based on emotions, T’s involve logic and reason in their decisions. • Judging and Perceiving (lifestyle): J’s prefer structure in their lives, while P’s like change. MBTI correlates with the Big Five personality traits of extraversion and openness, to a lesser extent with agreeableness and consciousness, but not with neuroticism (McCrae & Costa, 1989). The participants’ characteristics are too homogeneous for experiments concerning gender, mother tongue or region, but we find interesting distributions in at least two of the four MBTI preferences: .45 I vs. .55 E, .54 N vs. .46 S, .72 F vs. .28 F, and .81 J and .19 P. Personality measurement in general, and the MBTI is no exception, is a controversial domain. However, especially for scores on IE and NS dimensions, consensus is that they are correlated with personality traits. In the remainder of this paper, we will provide results on the prediction of personality types from features extracted from the linguistically analyzed essays. Feature Extraction While most stylometric studies are based on token-level features (e.g., word length), word forms and their frequencies of occurrence, syntactic features have been proposed as more reliable style markers since they are not under the conscious control of the author (Stamatatos et al., 2001). We use Memory-Based Shallow Parsing (MBSP) (Daelemans et al., 1999), which gives an incomplete parse of the input text, to extract reliable syntactic features. MBSP tokenizes, performs a part-of-speech analysis, looks for chunks (e.g., noun phrase) and detects subject and object of the sentence and some other grammatical relations. Features occurring more often than expected (based on the chi-square metric) in either of the two classes are extracted automatically for every document. Lexical features (lex) are represented binary or numerically, in n-grams. N-grams of both fine-grained (pos) and coarse-grained parts-of-speech (cgp) are integrated in the feature vectors.These features have been proven useful in stylometry (cf. Stamatatos et al., 2001) and are now tested for personality prediction. Experiments in Personality Prediction and Discussion We report on experiments on eight binary classification tasks (e.g., I vs. not-I) (cf. Table 1) and four tasks in which the goal is to distinguish between the two poles in the preferences (e.g., I vs. E) (cf.Table 2). Results are based on ten-fold cross-validation experiments with TiMBL (Daelemans & van den Bosch, 2005), an implementation of memory-based learning (MBL). MBL stores feature representations of training instances in memory without abstraction and classifies new instances by matching their feature representation to all instances in memory. We also report random and majority baseline results. Per training _____________________________________________________________________________ 147 Digital Humanities 2008 _____________________________________________________________________________ document, a feature vector is constructed, containing commaseparated binary or numeric features and a class label. During training, TiMBL builds a model based on the training data by means of which the unseen test instances can be classified. Task Feature set Precision Recall F-score Accuracy Introverted lex 3-grams random 56.70% 44.1% 84.62% 46.2% 67.90% 64.14% Extraverted cgp 3-grams random 58.09% 54.6.% 98.75% 52.5% 73.15% 60.00% iNtuitive cgp 3-grams random 56.92% 48.7.% 94.87% 48.7% 71.15% 58.62% Sensing pos 3-grams random 50.81% 40.3.% 94.03% 40.3.% 65.97% 55.17% Feeling lex 3-grams random 73.76% 72.6% 99.05% 73.3.% 84.55% 73.79% Thinking lex 1-grams random 40.00% 28.2.% 50.00% 27.5% 44.44% 65.52% Judging lex 3-grams random 81.82% 77.6% 100.00% 76.9% 90.00% 82.07% Perceiving lex 2-grams random 26.76% 6.9% 67.86% 7.1% 38.38% 57.93% Table 1:TiMBL results for eight binary classification tasks Table 1 suggests that tasks for which the class distributions are not skewed (I, E, N and S) achieve F-scores between 64.1% and 73.2%. As expected, results for Feeling and Judging are high, but the features and methodology still allow for a score around 40% for tasks with little training data. For the I-E task - correlated to extraversion in the Big Five - we achieve an accuracy of 65.5%, which is better than Argamon et al. (2005) (57%), Nowson & Oberlander (2007) (51%), and Mairesse et al. (2007) (55%). For the N-S task - correlated to openness - we achieve the same result as Mairesse et al. (2007) (62%). For the F-T and J-P tasks, the results hardly achieve higher than majority baseline, but nevertheless something is learned for the minority class, which indicates that the features selected work for personality prediction, even with heavily skewed class distributions. Conclusions and Future Work Experiments with TiMBL suggest that the first two personality dimensions (Introverted-Extraverted and iNtuitive-Sensing) can be predicted fairly accurately. We also achieve good results in six of the eight binary classification tasks. Thanks to improvements in shallow text analysis, we can use syntactic features for the prediction of personality type and author. Further research using the Personae corpus will involve a study of stylistic variation between the 145 authors. A lot of the research in author recognition is performed on a closedclass task, which is an artificial situation. Hardly any corpora – except for some based on blogs (Koppel et al., 2006) – have more than ten candidate authors. The corpus allows the computation of the degree of variability encountered in text on a single topic of different (types) of features when taking into account a relatively large set of authors. This will be a useful complementary resource in a field dominated by studies potentially overestimating the importance of these features in experiments discriminating between only two or a small number of authors. Task Feature set F-score [INFJ] F-score [ESTP] Average F-score Accuracy I vs. E lex 3-grams random majority 67.53% 63.24% 65.38% 65.52% 49.7% 55.2% N vs. S pos 3-grams random majority 58.65% 64.97% 61.81% 62.07% 44.8% 53.8% F vs.T lex 3-grams random majority 84.55% 13.64% 49.09% 73.79% 60.7% 72.4% Acknowledgements J vs. P lex 3-grams random majority 90.00% 13.33% 51.67% 82.07% 63.5% 80.7% This study has been carried out in the framework of the Stylometry project at the University of Antwerp. The “Computational Techniques for Stylometry for Dutch” project is funded by the National Fund for Scientific Research (FWO) in Belgium. Table 2:TiMBL results for four discrimination tasks Table 2 shows results on the four discrimination tasks, which allows us to compare with results from other studies in personality prediction. Argamon et al. (2005) find appraisal adjectives and modifiers to be reliable markers (58% accuracy) of neuroticism, while extraversion can be predicted by function words with 57% accuracy. Nowson & Oberlander (2007) predict high/low extraversion with a 50.6% accuracy, while the system achieves 55.8% accuracy on neuroticism, 52.9% on agreeableness, and 56.6% on conscientiousness. Openness is excluded because of the skewed class distribution. Taking a top-down approach, Mairesse et al. (2007) report accuracies of 55.0% for extraversion, 55.3% for conscientiousness, 55.8% agreeableness, 57.4% for neuroticism, and 62.1% for openness. _____________________________________________________________________________ 148 Digital Humanities 2008 _____________________________________________________________________________ References Argamon, S., Dhawle, S., Koppel, M. and Pennebaker, J. (2005), Lexical predictors of personality type, Proceedings of the Joint Annual Meeting of the Interface and the Classification Society of North America. Campbell, R. and Pennebaker, J. (2003), The secret life of pronouns: Flexibility in writing style and physical health, Psychological Science 14, 60-65. Daelemans, W. and van den Bosch, A. (2005), Memory-Based Language Processing, Studies in Natural Language Processing, Cambridge, UK: Cambridge University Press. Daelemans, W., Bucholz, S. and Veenstra, J. (1999), MemoryBased Shallow Parsing, Proceedings of the 3rd Conference on Computational Natural Language Learning CoNLL, pp. 53-60. Gill, A. (2003), Personality and language: The projection and perception of personality in computer-mediated communication, PhD thesis, University of Edinburgh. Gill, A. & Oberlander J. (2002), Taking care of the linguistic features of extraversion, Proceedings of the 24th Annual Conference of the Cognitive Science Society, pp. 363-368. Koppel, M., Schler, J., Argamon, S. and Messeri, E. (2006), Authorship attribution with thousands of candidate authors, Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval, pp. 659-660. Mairesse, F., Walker, M., Mehl, M. and Moore, R. (2007), Using linguistic cues for the automatic recognition of personality in conversation and text, Journal of Artificial Intelligence Research. McCrae, R. and Costa, P. (1989), Reinterpreting the MyersBriggs Type Indicator from the perspective of the Five-Factor Model of Personality, Journal of Personality 57(1), 17-40. Myers, I. and Myers, P. (1980), Gifts differing: Understanding personality type, Mountain View, CA: Davies-Black Publishing. Nowson, S. and Oberlander, J. (2007), Identifying more bloggers. Towards large scale personality classification of personal weblogs, Proceedings of International Conference on Weblogs and Social Media ICWSM. Sebastiani, F. (2002), Machine learning in automated text categorization, ACM Computing Surveys 34(1), 1-47. Stamatatos, E., Fakotakis, N. and Kokkinakis, G. (2001), Computer-based authorship attribution without lexical measures, Computers and the Humanities 35(2), 193-214. An Interdisciplinary Perspective on Building Learning Communities Within the Digital Humanities Simon Mahony simon.mahony@kcl.ac.uk King’s College London Introduction Recent research at the Centre for Computing in the Humanities at King’s College London has focussed on the role and place of the digital humanities in the academic curriculum of Higher Education (see Jessop:2005, Jessop:forthcoming). This work is based on the experience of both our undergraduate and postgraduate programmes focusing particularly on the way in which students are encouraged to integrate the content of a variety of digital humanities courses and apply it to their own research project. In the case of the undergraduates this is developed in conjunction with their home department. These courses are designed to train not just the new generation of young scholars in our discipline but also the majority who will gain employment in a variety of professions in industry and commerce. Our students come from a range of disciplines and backgrounds within the humanities and what is highlighted in each case is the necessity to ensure that their projects meet the scholarly criteria of their home disciplines and the interdisciplinary aspects of humanities computing.This emphasises the need for training the students in collaborative method and reflective practice; the need to build a community of learning which will lead to a community of practice. This paper discusses recent research and initiatives within distance learning, focussing on how these can be repurposed for campus-based courses, and is illustrated by the findings of their use in a digital humanities course. Context There have been a number of initiatives that are pertinent to this topic. The published report on the accomplishments of the Summit on Digital Tools for the Humanities convened in 2005 at the University of Virginia (http://www.iath.virginia. edu/dtsummit/ ) identified areas where innovative change was taking place that could lead to what they referred to as “a new stage in humanistic scholarship”. The style of collaboration enabled by digital learning community tools is identified as one such area. This has been further reinforced at the National Endowment of the Humanities hosted Summit Meeting of Digital Humanities Centers and Funders held in April 2007 at the University of Maryland. _____________________________________________________________________________ 149 Digital Humanities 2008 _____________________________________________________________________________ (https://apps.lis.uiuc.edu/wiki/display/DHC/Digital+Humanitie s+Centers+Summit ) On the summit wiki among the areas of research priorities and funder priorities John Unsworth lists: • Collaborative work • Teaching and learning • Collaboration among scholars (https://apps.lis.uiuc.edu/wiki/display/DHC/Areas+of+researc h+priorities%2C+funder+priorities ) Building communities of learning and developing strategies for collaborative working has been the subject of much study within the distance learning community (Anderson:2004, Brown:2006, Perry and Edwards:2005, Swan:2002, et al) and here it is argued that this also needs to be a consideration for campus-based academic programmes. The growing trend among undergraduate programmes of a movement away from set courses to the introduction of modular and credit systems means that students no longer follow a single programme of study. Their time is fragmented between their chosen course options and they often only come together with their peers for ‘core courses’. Thus in the last two decades study has become more of an individual rather than community-based activity. This trend needs to be compensated for by teaching collaborative skills, the very same skills that are at the heart of the majority of digital humanities research projects. The ‘Community of Inquiry’ model (developed by Garrison, Anderson and Archer:2000 and 2004) draws out the basic elements which overlap to form the educational experience of the distance learner: social, cognitive, and teaching presence. This model is used as a framework to analyse the effectiveness of asynchronous discussion methodologies (which encourage reflective practice), with particular regard for cognitive presence (where students construct meaning through communication with their peers, which is particularly important in the development of critical thinking), when used in campus-based courses. Building Networked Communities for Collaborative and Reflective Teaching and Learning The highly collaborative nature of modern research practice makes it clear that future humanities scholars need to be trained in the collaborative process and to understand the importance of critical reflection (Jessop:2005, Jessop:forthcoming). This emphasis on collaborative practice represents a shift in the academic culture of humanities away from the popular funding model of a single researcher towards one of team working where no single person has complete control or ownership. This is closer to models in operation in the sciences where progress is often based on team efforts and reports frequently have many authors; we may need to develop protocols that borrow some aspects of science research practice.The results of the author’s limited research into the effectiveness of the collaborative process in medical research practice are also to be included in this study. To develop an environment that fosters collaboration and reflection students should be actively encouraged to engage with each other both inside and outside of the classroom. With social software (MySpace, Facebook, LiveJournal, inter alia) students are already building networked communities, and the blog and wiki have provided educators with simple, readily available tools to build learning communities. The wiki can be deployed as an experiential and formative learning environment outside of the classroom with students able to create their own content, comment on each others, and share resources using tools like del.icio.us and MyIntute. The blog supports this with a less formal reflective space which belongs to the students themselves rather than their course. The asynchronous nature of these media gives students the opportunity to reflect on their classmates’ contribution in the processes of creating their own (Swan:2000) and so instil the practice of critical reflection. Further, with simple applications such as MyYahoo and iGoogle students can draw all their varied networks together along with course and departmental webpages thus giving a single interface or ‘Personal Learning Portal’ (PLP) through which to access and manage their online resources. In their PLP students create a web interface for their own digital environment that includes: • Content management where they integrate both personal and academic interests • A networking system for connection with others • Collaborative and individual workspace • Communications setup • A series of syndicated and distributed feeds This model is based on a presentation given by Terry Anderson at the Centre for Distance Education, University of London in March 2007: http://www.cde.london.ac.uk/support/news/generic3307. htm). In this Anderson discusses how the Personal Learning Environment (PLE), such as that used at Athabasca, is an improvement on the popular Virtual Learning Environment (VLE). That argument is developed here with stress upon the further advantages of the PLP introduced earlier. What is notable is that this model represents an approach rather than a specific application and is portable and not _____________________________________________________________________________ 150 Digital Humanities 2008 _____________________________________________________________________________ dependant on a single department or even institution. This ensures sustainability as it allows and encourages students to take the tools and skills from one area and apply them in others (arguably the basis of humanities computing, see McCarty and Short:2002). At the same time it puts the emphasis for the responsibility for managing their own learning and web resources on the students. In this approach learners are encouraged to interact and collaborate in a way that does not occur when static webpages are viewed with a traditional browser. The pages on a wiki and the student’s PLP are dynamic and mutable as they can be edited by the user through their web browser. Learners gain the ability to enrich the material and, unlike a print publication where those annotations are only for personal use, make these available for others. Such exchanges of ideas are central to the processes of building communities of learning and it is in this way that knowledge grows as we are able to push the boundaries of scholarship.The model that is developing here is one in which the student moves from being a reader of other peoples’ material to active engagement with that material; a transition from being a ‘reader’ to being an ‘interpreter’. Conclusion Education is an academic, individual, and a social experience that requires a sustainable community of learning. The tools and experiences developed in the distance learning field can be re-purposed for the ‘analogue’ students. The suggested model based on the student’s PLP is grounded in collaborative practice, uses asynchronous discussion to develop reflective practice and ensure cognitive presence; it is sustainable and portable. Putting this in the wider context, it is by building a community of learners that we will instil the cooperative, collaborative, and reflective skills needed for a community of humanities scholars; skills that are equally in demand outside of the academy. The tools have already been subject to limited trials in the digital humanities programmes at King’s College London but the current academic year will see a more extensive application of them across our teaching. The final version this paper will report on the results of the experiences of both teachers and learners of this model applied to a humanities computing course. The work contributes to the pedagogy of the digital humanities in the academic curricula both within the teaching of humanities computing and the development of tools for collaborative research and on-line learning communities. Bibliography Anderson T (2004) ‘Theory and Practice of Online Learning’, Athabasca University online books: http://www.cde.athabascau. ca/online_book/ (last accessed 01/11/07) Biggs J (1999) Teaching for quality and learning at university, Open University Press. Brown J S (2006) ‘New Learning Environments for the 21st Century’ http://www.johnseelybrown.com/newlearning.pdf (last accessed 30/10/07) Garrison D, Anderson T, and Archer W (2000) ‘Critical inquiry in a text-based environment: computer conferencing in higher education’. The Internet and Higher Education Volume 2, Issue 3 pp.87-105 Garrison D,Anderson T, and Archer W (2001) ‘Critical thinking, cognitive presence, and computer conferencing in distance education’. The American Journal of Distance Education Volume 15, Issue 1 pp.7-23. Garrison D, Anderson T, and Archer W (2004) ‘Critical thinking and computer conferencing: A model and tool for assessing cognitive presence’. http://communitiesofinquiry. com/documents/CogPresPaper_June30_.pdf (last accessed 26/10/07) Jessop M (2005) ‘Teaching, Learning and Research in Final year Humanities Computing Student Projects’, Literary and Linguistic Computing. Volume 20, Issue 3. Jessop M (Forthcoming) ‘Humanities Computing Research as a Medium for Teaching and Learning in the Digital Humanities’, Digital Humanities Quarterly. Mahony S (2007) ‘Using digital resources in building and sustaining learning communities’, Body, Space & Technology, Volume 07/02. McCarty W and Short H (2002) ‘A Roadmap for Humanities Computing’ http://www.allc.org/reports/map/ (last accessed 04/11/07) Moon J (2004) A Handbook of Reflective and Experiential Learning: Theory and Practice, RoutledgeFalmer Perry B and Edwards M (2005) ‘Exemplary Online Educators: Creating a Community of Inquiry’ http://www.odlaa.org/ events/2005conf/ref/ODLAA2005PerryEdwards.pdf (last accessed 04/11/07) Richardson C and Swan K (2003) ‘Examining social presence in online courses in relation to students’ perceived learning and satisfaction’, Journal of Asynchronous Learning Volume 7, Issue 1 pp. 68-88 Siemens G (2006) Knowing Knowledge, Lulu.com Anderson T (2007) ‘Are PLE’s ready for prime time?’ http:// terrya.edublogs.org/2006/01/09/ples-versus-lms-are-plesready-for-prime-time/ (last accessed 03/11/07) _____________________________________________________________________________ 151 Digital Humanities 2008 _____________________________________________________________________________ Swan K (2000) ‘Building Knowledge Building Communities: consistency, contact and communication in the virtual classroom’, Journal of Educational Computing Research, Volume 24, Issue 4 pp. 359-383 Swan K (2002) ‘Building Learning Communities in Online Courses: the importance of interaction’, Education, Communication & Information, Volume 2, Issue 1 pp. 23-49 Terras M (2006) ‘Disciplined: Using Educational Studies to Analyse ‘Humanities Computing’’, Literary and Linguistic Computing,Volume 21, pp. 229-246. The Middle English Grammar Corpus - a tool for studying the writing and speech systems of medieval English Martti Mäkinen martti.makinen@uis.no University of Stavanger, Norway The Middle English Grammar Project The Middle English Grammar Project (MEG), shared by the Universities of Glasgow and Stavanger, is working towards the description of Middle English orthography, morphology and phonology. MEG is among the first attempts to span the gap between Jordan’s Handbuch der mittelenglischen Grammatik: Lautlehre (1925) and now. Our aim is to combine the advances in Middle English dialectology in the latter half of the 20th century and the computing power currently available in the service of writing an up-to-date grammar of Middle English. Middle English dialects and dialectology The study of Middle English dialects took a giant leap forward by Angus McIntosh’s insight that Middle English texts represent distinct regional varieties in their spelling systems, and therefore the spelling variants of these texts could be studied in their own right and not merely as reflections of the then speech systems, i.e. dialects (McIntosh 1963). This multitude of regional spellings arose when English had been replaced by French and Latin in all important aspects for nearly two centuries after the Norman Conquest: the reintroduction of English into literary and utilitarian registers from the thirteenth century onwards was not governed by any nationwide standard and thus English was written according to each scribe’s perception of the ‘correct’ spelling. McIntosh’s vision led to a project that grew into A Linguistic Atlas of Late Mediaeval English (LALME; 1986). Aims of MEG The Middle English Grammar project builds on the work of the LALME team and aims at producing a description of Middle English orthography, phonology and morphology, from 1100 to 1500. Here we use the term grammar in a wide, philological sense. Eventually, the grammar is meant as an replacement to Richard Jordan’s Handbuch der mittelenglischen Grammatik: Lautlehre, and to provide a reference point for the students and scholars of Middle English in the form of a broad description of the Middle English usages accompanied by more specific county studies, and eventually also all the base material we accumulate for this task (Black, Horobin, and Smith 2002: 13). _____________________________________________________________________________ 152 Digital Humanities 2008 _____________________________________________________________________________ The Middle English Grammar: How? The first task of MEG is to compile a corpus of Middle English texts localized in LALME. The corpus is called The Middle English Grammar Corpus, or MEG-C (the first installment forthcoming 2007). Secondly, the corpus texts need appropriate lemmatization and annotation in order to be usable in the course of MEG. Linguistic data is collected by transcribing extracts from either the original manuscripts, or good-quality microfilms. The prioritised material are the texts that were localized in LALME, although later texts that were not analysed for LALME will be taken into account as well. LALME covers years 1350-1450 (-1500); the material for the studies in 1100-1350 will be drawn from A Linguistic Atlas for Early Middle English (LAEME) (Laing and Lass, forthcoming 2007). The manuscript texts are represented by 3,000-word extracts (or in toto, if shorter), which should be sufficiently for studies on orthography, phonology and morphology. The planned corpus will sample c. 1,000 texts, therefore the projected size of the corpus is 2.5-3 M words. The conventions of transcription have been derived from those of the LAEME and A Linguistic Atlas of Older Scots projects (LAOS), with certain modifications. The most important questions that have been addressed during the transcription process have been whether to emphasise fidelity to the original vs. wieldy transcripts, and should the transcripts offer an interpretative reading of the manuscript text rather than the scribe’s actual pen strokes. According to the principles chosen, the transcriptions attempt to capture the graphemic and broad graphetic details, but not necessarily each detail on the level of individual handwriting (Black, Horobin, and Smith 2002: 11). MEG-C: lemmatization, annotation, publication The second practical task is to lemmatize and to annotate the Corpus. Previous historical English corpora (Helsinki Corpus, Middle English Medical Texts) show the limitations the lack of lemmas set to the corpus user when tackling the variety of spellings attested to by Middle English texts. The lemmas in MEG-C will have an Oxford English Dictionary headword. There will also be another cue in the source language (the direct source languge before Middle English, usually Old English, French/Anglo-Norman or Latin). These two reference points on either side of Middle English will provide the user the means to search for occurrences of a lexical item even when the full range of spelling variation in Middle English is not known. As regards the annotation of words of a text, they are divided into bound morphemes and other spelling units (this system is partly derived from Venezky (1970)). Each word is divided into a word initial sequence containing Onset and Nucleus, and they are followed by a series of Consonantal and Vowel Spelling Units. Each spelling unit is also given the equivalents in the source language and in Present Day English, thus enabling the search for e.g. all the ME reflexes of OE [a:] or or the spelling variants in Middle English that correspond to Present Day English word initial spelling sh-. For the task of annotation and lemmatization the corpus is rendered into a relational database. The database plan has tables for different extralinguistic information, and the actual texts will be entered word by word, i.e. in the table for corpus texts, there will be one record for each word. The annotation plan we are intending to carry out should result in a corpus where one can search for any combination of extralinguistic factors and spelling units with reference points embedded in the actual Middle English texts and also in the source language and PDE spelling conventions. The first installment of MEG-C will be published in 2007, containing roughly 30 per cent of the texts in the planned corpus in ASCII format. It will be on the Internet, accessible for anyone to use and download. Our aim with publication is two-fold: firstly, we will welcome feedback of any kind, and especially from scholars who know the texts well; secondly, we want to encourage and to see other scholars use the corpus. References Black, Merja, Simon Horobin and Jeremy Smith, 2002. ‘Towards a new history of Middle English spelling.’ In P. J. Lucas and A.M. Lucas (eds), Middle English from Tongue to Text. Frankfurt am Main: Peter Lang, 9-20. Helsinki Corpus = The Helsinki Corpus of English Texts (1991). Department of English, University of Helsinki. Compiled by Matti Rissanen (Project leader), Merja Kytö (Project secretary); Leena Kahlas-Tarkka, Matti Kilpiö (Old English); Saara Nevanlinna, Irma Taavitsainen (Middle English); Terttu Nevalainen, Helena Raumolin-Brunberg (Early Modern English). Horobin, Simon and Jeremy Smith, 1999. ‘A Database of Middle English Spelling.’ Literary and Linguistic Computing 14: 359-73. Jordan, Richard, 1925. Handbuch der mittelenglischen Grammatik. Heidelberg: Winter’s Universitätsbuchhandlung. LAEME = Laing, Margaret, and Lass, Roger, forthcoming 2007. A Linguistic Atlas of Early Middle English. University of Edinburgh. Nov. 22nd, 2007. http://www.lel.ed.ac.uk/ihd/ laeme/laeme.html Laing, M. (ed.) 1989. Middle English Dialectology: essays on some principles and problems by Angus McIntosh, M.L. Samuels and Margaret Laing. Aberdeen: Aberdeen University Press. _____________________________________________________________________________ 153 Digital Humanities 2008 _____________________________________________________________________________ LALME = McIntosh, M.L., Samuels, M.L. and Benskin, M. (eds.) 1986. A Linguistic Atlas of Late Mediaeval English. 4 vols. Aberdeen: Aberdeen University Press. (with the assistance of M. Laing and K. Williamson). LAOS = Williamson, Keith, forthcoming 2007. A Linguistic Atlas of Older Scots. University of Edinburgh. Nov. 22nd, 2007. http://www.lel.ed.ac.uk/research/ihd/laos/laos.html McIntosh, A. 1963 [1989]. ‘A new approach to Middle English dialectology’. English Studies 44: 1-11. repr. Laing, M. (ed.) 1989: 22-31. Middle English Medical Texts = Taavitsainen, Irma, Pahta, Päivi and Mäkinen, Martti (compilers) 2005. Middle English Medical Texts. CD-ROM. Amsterdam: John Benjamins. Stenroos, Merja, 2004. ‘Regional dialects and spelling conventions in Late Middle English: searches for (th) in the LALME data.’ In M. Dossena and R. Lass (eds), Methods and data in English historical dialectology. Frankfurt am Main: Peter Lang: 257-85. Stenroos, Merja, forthcoming 2007. ‘Sampling and annotation in the Middle English Grammar Project.’ In Meurman-Solin, Anneli and Arja Nurmi (eds) Annotating Variation and Change (Studies in Variation, Contacts and Change in English 1). Research Unit for Variation, Change and Contacts in English, University of Helsinki. http://www.helsinki.fi/varieng/journal/ index.html Venezky, R., 1970. The Structure of English Orthography. The Hague/Paris: Mouton. Designing Usable Learning Games for the Humanities: Five Research Dimensions Rudy McDaniel rudy@mail.ucf.edu University of Central Florida, USA Stephen Fiore sfiore@ist.ucf.edu University of Central Florida, USA Natalie Underberg nunderbe@mail.ucf.edu University of Central Florida, USA Mary Tripp mtripp@mail.ucf.edu University of Central Florida, USA Karla Kitalong kitalong@mail.ucf.edu University of Central Florida, USA J. Michael Moshell jm.moshell@cs.ucf.edu University of Central Florida, USA A fair amount of research suggests that video games can be effective tools for learning complex subject matter in specific domains (Cordova & Lepper, 1996; Ricci, Salas, & CannonBowers, 1996; Randel et al., 1992; Fiore et al., 2007; Garris & Ahlers et al., 2002). Although there has been much work in situating “serious gaming” (Sawyer, 2002) as a legitimate vehicle for learning in all types of disciplines, there is no central source for research on crafting and designing what is considered to be a usable game for the humanities. In this paper, we discuss research issues related to the design of “usable” humanitiesbased video games and situate those research questions along five interrelated dimensions. We first provide our definition of usability as used in this context. A usable humanities game is a game which is both functionally capable in terms of an interface and its human interactions as well as appropriately designed to support the types of learning objectives attached to its scenarios or levels. This definition is formed from the convergence of traditional interface usability and learning effectiveness. We support our delineation of research questions using theoretical work from scholars writing about gaming as well as applied examples from our own research and game development prototypes. Drawing from three years of experience building a variety of games, and literature culled from our various fields, we discuss some of the unique research questions that designing for the humanities pose for scholars and game designers. We separate these questions across five dimensions, each with representative humanities learning questions: _____________________________________________________________________________ 154 Digital Humanities 2008 _____________________________________________________________________________ 1. Participation: how can we encourage diverse groups of students and researchers to interact together in virtual space? How can we design a space that is appealing and equitable to both genders and to a diverse range of demographic profiles? 2. Mechanics: how do you design algorithms that are applicable to the types of tasks commonly sought in humanities courses? For instance, how might one develop an algorithm to “score” morally relative gameplay decisions in an ethical dilemma? 3. Ontology: how is a player’s self-image challenged when shifting from a real to what Gee (2003) calls a projected identity, and how is this process changed through the use of varying perspectives such as first-person or third-person perspective? 4. Hermeneutics: how do we probe the hidden layers that exist between game worlds, source content, and human computer interfaces? What “voices” are encouraged in virtual worlds, and what voices are repressed within this rhetorical space? 5. Culture: how are the cultural facets of primary source content mediated through the process of digitization? To consider these dimensions, we craft a theoretical base formed from a wide range of researchers such as Gee (2003), Prensky (2001), Squire (2002), and Jenkins (2006a, 2006b). From Gee, we draw inspiration from his discussion of well-designed games and his exploration of the implicit learning occurring in several different genres of digital video games. Much of Gee’s work has involved applying insights from the cognitive sciences to traditional humanities domains such as literature in order to explore identity, problem solving skills, verbal and nonverbal learning, and the transfer of learned abilities from one task to another. Marc Prensky notes that musical learning games in the humanities have been used for hundreds of years -- Bach’s Well Tempered Clavier and The Art of the Fugue are his “learning games,” simple to complex musical exercises that build skill. Prensky’s work also informs our historical analysis as well as insights from the pioneers working in the field of serious gaming for military applications. Jenkins’ work in applying the interests of gaming fans as critical lenses provides insight for both formative guidelines and posttask measures of “success” in learning game environments. These gaming discourse communities often form wildly active and influential fan groups, and these groups cultivate their own forms of expression and understanding through complex jargon, virtual initiations, and ritualistic rules and procedures in virtual interaction. Gaming environments and virtual worlds have also been shown to offer rich sources of material for investigating notions of gender, race, ethnicity, and cultural identity (Berman & Bruckman, 2001; Squire, 2002). Building on the work of these scholars, others have extended these general notions of digital game based learning to account for specific curricula or learning objectives such as media project management for humanities computing (McDaniel et al., 2006). To build these humanities learning games, we have assembled an interdisciplinary team composed of faculty members from Digital Media, English, and Philosophy. Individuals from this group have worked in a variety of capacities, as script writers, artists, programmers, and producers. Undergraduate and graduate students, in both classroom and research lab roles, have worked and contributed to each of these games in varying capacities. Five different games have been produced through these collaborations: 1. Discover Babylon (Lucey-Roper, 2006), a game produced by the Federation of American Scientists, the Walters Art Museum in Baltimore, and UCLA’s Cuneiform Digital Library Initiative (CDLI). One of our team members developed the storyline for this game. 2. The Carol Mundy Underground Railroad game (GreenwoodEricksen et al., 2006) examines issues of African-American history and culture and leads a player on an adventure from a Southern plantation to safety in the North through the Underground Railroad’s system of safehouses. This was a “modded” game built atop Neverwinter Nights. 3. The Turkey Maiden (Underberg, forthcoming) is an educational computer game project based on a version of Cinderella collected by folklorist Ralph Steele Boggs in 1930s Ybor City, Florida. This variant of the Cinderella story, called “The Turkey Maiden” (from Kristin Congdon’s anthology Uncle Monday and Other Florida Tales, 2001) forms the narrative structure of the game, which has been further developed by integrating specific tasks that the heroine Rosa (“Cinderella”) must successfully complete to advance in the game that are based in lessons to be learned by the player about Florida history and culture. 4. Chaucer’s Medieval Virtual World Video Game is a virtual medieval world game based on Chaucer’s Canterbury Tales. This tale emphasizes the battle between men’s perceived authority and women’s struggles for power. This game uses Chaucer’s narrative gap as a springboard for a virtual medieval quest. In addition to experiencing historical scenarios, the knight will disguise his identity and experience the world from various gender and social classes, the Three Estates of Clergy, Nobility, and Peasantry as well as the three feminine estates of virgin, widow, and wife. 5. The Medium is a prototype ethics game designed using the Torque game engine. This three year project was funded by the University of Central Florida’s Office of Information Fluency and is in its earliest stages of design. The game pairs a time travel theme with a switchable first and third person perspective and also includes an environmentalist subtext. We will use brief examples culled from these games to _____________________________________________________________________________ 155 Digital Humanities 2008 _____________________________________________________________________________ support our theoretical assertions and discuss ways in which usable humanities games can act as a springboard for bringing together subject matter experts, technologists, and learners. In their 2006 report on Cyberinfrastructure for the Humanities and Social Sciences, the American Council of Learned Societies writes the following in regards to digital networked technologies for the humanities: “A cyberstructure for humanities and social science must encourage interactions between the expert and the amateur, the creative artist and the scholar, the teacher and the student. It is not just the collection of data—digital or otherwise—that matters: at least as important is the activity that goes on around it, contributes to it, and eventually integrates with it” (14). Our goal is to foster this type of humanistic, communicative environment using new technologies for virtual worlds, usability testing, and game-based learning environments.We hope that the scholarly community working the field of digital humanities can help us to explore and refine both theoretical models and applied technologies related to this goal. References American Council of Learned Societies. 18 July, 2006. “Our Cultural Commonwealth: The Report of the American Council of Learned Societies’ Commission on Cyberinfrastructure for the Humanities and Social Sciences.” 13 October 2007. <http://www.acls.org/cyberinfrastructure/ OurCulturalCommonwealth.pdf>. Berman, J. & Bruckman, A. (2001). The Turing Game: Exploring Identity in an Online Environment. Convergence, 7(3), 83-102. Congdon, K. (2001). Uncle Monday and other Florida Tales. Jackson, MS: University Press of Mississippi. Cordova, D. I., & Lepper, M. R. (1996). Intrinsic motivation and the process of learning: beneficial effects of hypercontextualization, personalization, and choice. Journal of Educational Psychology, 88(4), 715-730. Fiore, S. M., Metcalf, D., & McDaniel, R. (2007). Theoretical Foundations of Experiential Learning. In M. Silberman (Ed.), The Experiential Learning Handbook (pp. 33-58): John Wiley & Sons. Garris, R., R. Ahlers, et al. (2002). “Games, Motivation, and Learning: A Research and Practice Model.” Simulation Gaming 33(4), 441-467. Greenwood-Ericksen, A., Fiore, S., McDaniel, R., Scielzo, S., & Cannon-Bowers, J. (2006). “Synthetic Learning Environment Games: Prototyping a Humanities-Based Game for Teaching African American History.” Proceedings of the 50th Annual Meeting of the Human Factors and Ergonomics Society. Santa Monica, CA: Human Factors and Ergonomics Society. Jenkins, H. (2006). Convergence Culture. New York: New York University Press. Jenkins, H. (2006). Fans, Bloggers, and Gamers. New York: New York University Press. Lucey-Roper M. (2006), Discover Babylon: Creating A Vivid User Experience By Exploiting Features Of Video Games And Uniting Museum And Library Collections, in J. Trant and D. Bearman (eds.). Museums and the Web 2006: Proceedings, Toronto: Archives & Museum Informatics, published March 1, 2006 at <http://www.archimuse.com/mw2006/papers/luceyroper/lucey-roper.html>. McDaniel, R., Fiore, S. M., Greenwood-Erickson, A., Scielzo, S., & Cannon-Bowers, J. A. (2006).Video Games as Learning Tools for Project Management. The Journal of the International Digital Media and Arts Association, 3(1), 78-91. Prensky, M. (2001). Digital Game-Based Learning. New York: McGraw-Hill. Randel, J. M., Morris, B. A., Wetzel, C. D., & Whitehill, B.V. (1992). The effectiveness of games for educational purposes: a review of recent research. Simulation & Gaming, 23(3), 261276. Ricci, K. Salas, E., & Cannon-Bowers, J. (1996). “Do ComputerBased Games Facilitate Knowledge Acquisition and Retention?” Military Psychology 8(4), 295-307. Sawyer, B. (2002). Serious Games: Improving Public Policy through Game-based Learning and Simulation. Retrieved June 24 2006, from http://www.seriousgames.org/images/seriousarticle.pdf. Squire, K. (2002). Cultural Framing of Computer/Video Games. Game Studies, 2(1). Underberg, Natalie (forthcoming). “The Turkey Maiden Educational Computer Game.” In Folklife in Education Handbook, Revised Edition, Marshall McDowell, ed. Gee, J. P. (2003). What Video Games Have to Teach Us About Learning and Literacy. New York: Palgrave Macmillan. _____________________________________________________________________________ 156 Digital Humanities 2008 _____________________________________________________________________________ Picasso’s Poetry:The Case of a Bilingual Concordance Luis Meneses ldmm@cs.tamu.edu Texas A&M University, USA Carlos Monroy cmonroy@cs.tamu.edu Texas A&M University, USA Richard Furuta furuta@cs.tamu.edu Texas A&M University, USA Enrique Mallen mallen@shsu.edu Sam Houston State University, USA Introduction Studying Picasso as writer might seem strange, considering that the Spanish artist is mostly known for his paintings. However, in the Fall of 2006 we began working on Picasso’s writings. Audenaert, et.al. [1], describe the characteristics of Picasso’s manuscripts, and the challenges they pose due to their pictorial and visual aspects. With over 13,000 artworks up-to-date catalogued, the On-line Picasso Project [5] includes a historical narrative of relevant events in the artist’s life. In this paper we discuss the contents of the texts—from the linguistic standpoint—and the implementation of a bilingual concordance of terms based on a red-black tree. Although concordances have been widely studied and implemented within linguistics and humanities, we believe that our collection raises interesting challenges; first because of the bilingual nature of Picasso’s poems, since he wrote both in Spanish and French, and second, because of the connection between his texts and his paintings. The work reported in this paper focuses on the first issue. Integrating Texts Into the On-line Picasso Project In the Fall of 2007 Picasso’s texts were added to the collection along with their corresponding images, and a concordance of terms both in Spanish and French was created.The architecture created for Picasso’s poems is partially based on the one we developed for the poetry of John Donne [7]. As often happens in the humanities, each collection has its own characteristics, which makes a particular architecture hard if not impossible to reuse directly. For example, Donne’s texts are written in English; Picasso in contrast wrote both in Spanish and French. The Concordance of Terms A concordance, according to the Oxford English Dictionary, is “an alphabetical arrangement of the principal words contained in a book, with citations of the passages in which they occur.” When applied to a specific author’s complete works, concordances become useful tools since they allow users to locate particular occurrences of one word, or even more interestingly, the frequency of such words in the entire oeuvre of an author. Apparently, the first concordances in English ever put together were done in the thirteenth century, and dealt with the words, phrases, and texts in the Bible. Such concordances were intended for the specialized scholar of biblical texts and were never a popular form of literature. As might be expected, these were soon followed by a Shakespeare concordance. A concordance of the literary works of Pablo Picasso has more in common with a Biblical concordance than with a Shakespearian concordance, due to the manner in which the Spanish artist/poet composed his poems. Many critics have pointed out the heightened quality of words in Picasso’s texts, a value that surpasses their own sentential context. One gets the impression that words are simply selected for their individual attributes and are then thrown together in the poems. Picasso himself appears to admit using this technique when he is quoted as saying that “words will eventually find a way to get along with each other.” For this reason, readers of Picasso’s poems become well aware of the frequent recurrence of certain “essential words,” which one is then eager to locate precisely in each of the poems to determine significant nuances. A catalogue raisonné is a scholarly, systematic list of an artist’s known works, or works previously catalogued. The organization of the catalogues may vary—by time period, by medium, or by style—and it is useful to consult any prefatory matter to get an idea of the overall structure imposed by the cataloguer. Printed catalogues are by necessity constrained to the time in which they are published.Thus, quite often catalogue raisonnés are superseded by new volumes or entirely new editions, which may (or may not) correct an earlier publication [2]. Pablo Picasso’s artistic creations have been documented extensively in numerous catalogs. Chipp and Wofsy [3], started publishing a catalogue raisonné of Picasso’s works that contains an illustrated synthesis of all catalogues to date on the works of Pablo Picasso. _____________________________________________________________________________ 157 Digital Humanities 2008 _____________________________________________________________________________ Concordances are often automatically generated from texts, and therefore fail to group words by classes (lexical category, semantic content, synonymy, metonymy, etc.) The Concordance we are developing will allow users to group words in such categories, thus concentrating on the network of interrelations between words that go far beyond the specific occurrences in the poems. Figure 1. A tabbed interface presents users with French and Spanish Texts in chronological order. On the right are text and image presentations By narrowing down the number of these “essential words,” the concordance also allows users to delimit the “thematic domain” elaborated in Picasso’s writings. Clearly many words deal with physical duress and the confrontation between good and evil, as manifestations of concrete human suffering during the Spanish Civil War and the German occupation of France in World War II. By identifying the range of words employed, users can clearly determine the political and cultural environment that surrounds Picasso’s artistic creations during this period. Nevertheless, one must not forget that Picasso’s main contribution to the world is that of a plastic artist. A Concordance will allow users to identify each of the words Picasso used and link them to specific graphic images in his artworks. It has been argued that Picasso’s poetry is quite “physical” (he often refers to objects, their different colors, textures, etc.). Even in his compositional technique, one gets a sense that the way he introduces “physical” words into his poems emulates the manner in which he inserted “found objects” in his cubist collages. Many critics have pointed out, on the other hand, that Picasso’s art, particularly during his Cubist period, is “linguistic” in nature, exploring the language of art, the arbitrariness of pictorial expression, etc. Mallen [6] argues that Cubism explored a certain intuition Picasso had about the creative nature of visual perception. Picasso realized that vision involves arbitrary representation, but, even more importantly, that painting also does. Once understood as an accepted arbitrary code, painting stopped being treated as a transparent medium to become an object in its own right. From then on, Picasso focused his attention on artistic creation as the creative manipulation of arbitrary representations.Viewed in this light, painting is merely another language, governed by similar strict universal principles as we find in verbal language, and equally open to infinite possibilities of expression. A concordance allows users to see these two interrelated aspects of Picasso’s career fleshed out in itemized form. Picasso is a bilingual poet. This raises several interesting questions connected to what has been pointed out above. One may wonder, for instance, if Picasso’s thematic domain is “language-bound,” in other words, whether he communicates certain concepts in one language but not in the other. A Concordance will allow users to set correspondences between words in one language and another. Given Picasso’s strong Spanish heritage, it would be expected that concrete ideas (dealing with food, customs, etc) will tend to be expressed exclusively in Spanish, while those ideas dealing with general philosophical and religious problems will oscillate between the two languages. The possible corroboration of this point is another objective of the planned Concordance. Figure 2. Concordance with term frequency. On the right, term-in-context display. The Red Black Tree Implementation Term concordances require the extraction of terms from a large corpus along with the metadata related to their occurrences, an operation often computationally expensive. The Digital Donne for instance, is a pre-computed concordance. On the other hand, repositories of texts are under constant revision as errors are detected and corrected. When the corpus is modified, part or the entire concordance has to be rebuilt. To solve this problem, the Picasso’s concordance is computed on the fly, without requiring any previous processing. The repository of poems has initially been divided into poems, stanza, and lines, then stored in a database. Using standard join operations, the poems are reconstructed, allowing the terms to be retrieved along additional metadata such as title, section, and line number. Once the poems have been reconstructed, each poem line is broken down into terms, which are defined as a series of characters delimited by a textual space. Each _____________________________________________________________________________ 158 Digital Humanities 2008 _____________________________________________________________________________ occurrence of each term in turn, is inserted into the data structure, as well as its occurrence metadata. Our algorithm consists of an augmented data structure composed of a Red Black Tree [4,8], where each node represents one term found in Picasso’s writings and is used as pointer to a linked list.A Red Black Tree is a self balanced binary tree, that can achieve insert, delete, and search operations in O(log n) time. Because only insertions and random access is not required on the linked list, term occurrences can be traversed sequentially in O(n) time. Picasso’s literary legacy can be browsed and explored using titles as surrogates, which are ordered by date and by the term concordance. Each entry provides the number of occurrences and its corresponding list, which can be used as index to browse the poems. The retrieval process for terms and their occurrences is carried out by selecting specific terms or choosing from an alphabetical index of letters. To achieve this, a subtree is extracted from the data structure and it is traversed, obtaining every occurrence of a term along with additional metadata including a unique poem identifier, page and line number. Extensible Stylesheet Language Transformations (XSLTs) are used to transform the resulting XML output, and extract the specific term occurrences within a line of the poem and produce a surrogate, which is composed of a portion of the text. Additionally, this surrogate gives access to the corresponding poems through hyperlinks. A new component of our implementation is a Spanish-French thesaurus that correlates terms in both languages, along with their meanings and commentary. Because our concordance is created on the fly, we have to devise a mechanism to support this. Additionally, this approach still remains to be tested with different corpuses in other languages, especially where terms are separated uniquely and spaces between them play a different role in language constructs. The term extraction algorithm is efficient using spaces as delimiters—a common case both in Spanish and French. However, other languages might include composite words. References 1. Audenaert, N., Karadkar, U., Mallen, E., Furuta, R., and Tonner, S. “Viewing Texts: An Art-Centered Representation of Picasso’s Writings,” Digital Humanities (2007): 14-16. 2. Baer, B. Picasso Peintre-Graveur. “Catalogue Raisonné de L’oeuvre Gravé et des Monotypes.” 4 Vols. Berne, Editions Kornfeld, 1986-1989. 3. Chipp, H. and Wofsy, A. The Picasso Project. San Francisco: Alan Wofsy Fine Arts. 1995-2003. 4. Cormen, T., Leiserson, C., Rivest, R., and Stein, C., Introduction to Algorithms, The MIT Press, 2nd Edition, 2001. 5. Mallen, Enrique. “The On-line Picasso Project.” http:// picasso.tamu.edu accessed October 2007. 6. Mallen, Enrique. The Visual Grammar of Pablo Picasso. Berkeley Insights in Linguistics & Semiotics Series. New York: Peter Lang. 2003. 7. Monroy, C., Furuta, R., and Stringer, G., “Digital Donne: Workflow, Editing Tools, and the Reader’s Interface of a Collection of 17th-century English Poetry.” Proceedings of JCDL 2007,Vancouver, B.C. 2007. 8. Red-Black Tree. Accessed 2007-11-15, http://en.wikipedia. org/wiki/Red_black_tree Acknowledgements This material is based upon work supported by the National Science Foundation under Grant No. IIS-0534314. _____________________________________________________________________________ 159 Digital Humanities 2008 _____________________________________________________________________________ Exploring the Biography and Artworks of Picasso with Interactive Calendars and Timelines Luis Meneses ldmm@cs.tamu.edu Texas A&M University, USA Richard Furuta furuta@cs.tamu.edu Texas A&M University, USA Enrique Mallen mallen@shsu.edu Sam Houston State University, USA Introduction Scholars and general users of digital editions face a difficult and problematic scenario when browsing and searching for resources that are related to time periods or events. Scrolling continuously through a long list of itemized search results does not constitute an unusual practice for users when dealing with this type of situation. The problem with this searching mechanism is that a notion of the corresponding dates or keywords associated with the event are required and constitute a precondition to a successful search. An ordered list is unable to provide the required affordances and constraints that users need and desire to conduct scholarly research properly. It is a common practice among users to utilize the search mechanism present in most web browsers, and then perform another search among the obtained results to “narrow down” or limit the results to a smaller working set that is easier to manage. The use of an external search mechanism in a digital edition is a strong indicator that improved interfaces must be designed, conceived and implemented, just to achieve the sole purpose of facilitating scholarly research. Background The Online Picasso Project (OPP) is a digital collection and repository maintained by the Center for the Study of Digital Libraries at Texas A&M University, and the Foreign Languages Department at Sam Houston State University.As of November 2007, it contains 13704 catalogued artworks, 9440 detailed biographical entries, a list of references about Picasso’s life and works, and a collection of articles from various sources regarding the renowned Spanish artist. How does the OPP provide its content? Java Servlets are used to retrieve the documents and metadata from a MySQL database. As a result, a Apache Tomcat web server outputs a XML which is linked to XSLTs and CSS. The later are used to perform a dynamic transformation into standards-compliant HTML, achieving a clear separation between content and presentation. The OPP includes an interface that allows scholars and users in general to browse through the significant events in his life, artworks, and a list of museums and collections that hold ownership to the various art objects created by the artist during his lifetime. The implemented navigation scheme works well for experienced scholars who have a deep knowledge of Picasso’s life and works.The amount of available information can be overwhelming to the project audience, composed primarily of art scholars and historians, because of its magnitude and painstaking detail. The Humanities rely profoundly on dates to create a strong relationship between events and documents. It is obvious to assume that key events influenced Picasso in such a way, that they caused significant changes in artistic style and expression. The OPP contains a vast amount of information that could be used in conjunction with the proposed interfaces, in order to help answer this type of inquiries.The calendars and timelines in the OPP propose an alternate method for exploring an existing document collection since they use date-related metadata as a discriminating factor, instead of an ordering criterion. Dates are used to provide mechanisms and interfaces that allow rich exploration of the artist’s legacy in order to get a more whole and concise understanding of his life Calendars The calendar interfaces were developed to provide a timetable for the creation of artworks and occurrence of events cataloged in the OPP. Their purpose is to provide with a quick glance, relevant biographical and artistic dates. Additionally, the calendars provide means for formulating direct comparisons between dates within a single year. months and seasons. The calendar interfaces have 5 display possibilities to filter results, which apply to the artworks and to the narrative: 1. Show start date and end date: used to display “exact matches”. 2. Show start date or end date: 3. Show start date only: 4. Show End date only 5. Show Ranges of dates Surrogates are provided in the form of artwork thumbnails and textual description of the events. Clicking on the specified date, month or season produces a web page, where detailed descriptions can be accessed. _____________________________________________________________________________ 160 Digital Humanities 2008 _____________________________________________________________________________ Colors were added to the dates containing items, to show the distribution of the artworks and events. The design decision to include additional stratification schemes relates to the research goal of providing an enhanced browsing mechanism. The inclusion of this feature does not implicate any greater additional processing of the data, but it provides a richer environment for browsing the document collection. Figure 2: Event surrogates – start and end dates Additionally, this interface allows the users of the project to quickly determine periods where the artist was more prolific. Dates where Picasso was “active” are clearly visible and identifiable. This data stratification gives users an additional layer of information. Figure 1: Artwork surrogates – ranges of dates The use of calendar interfaces provides new possibilities for scholars and users in general: the discovery of relationships between documents, which standard HTML interfaces do not facilitate. The main advantages derived from their use include: 1. The possibility of visualizing an entire year in Picasso’s biography and artistic career. Through the use of a Calendar-based interface, artworks can be visually identified to their specific dates of creation. This provides a visualization mechanism that allows the user to navigate through a potentially large number of artworks in one screen.The number of artworks that can be accessed, depends on how esthetically prolific the artist was in that specific year. For the case of biographical events, a similar scenario is created. Users can navigate through an entire year of events, and the information is presented in a way that affords quick navigation and encourages interaction. Visually, periods where more events occurred in Picasso’s life are easily identifiable. 2. The possibility of moving to specific day, month or season within a year in one single interaction with the interface. Through the use of information surrogates, users have the possibility of moving to a specific day, month or season within a year with a single click. The actions produced by scrolling through multiple screens are eliminated, and users can view the artifacts produced on a specific date with ease. Consequently, comparisons between artworks can be achieved fluidly due to the enhancements in the browsing environment. Similarly, users can read about the specific events in Picasso’s biography by visually selecting concrete dates. The deployment of the Calendar interfaces, produce visualizations that enable scholars to determine time periods or specific dates in Picasso’s life.They serve as a tool that helps identify when the artist was more prolific, as well the opposite: when less artworks were produced. This analysis could also be accomplished by browsing through the artwork collection but it requires additional interaction from the user, which at the end equals more time and effort. The interfaces also constitute a new way of browsing the document collection: _____________________________________________________________________________ 161 Digital Humanities 2008 _____________________________________________________________________________ visually and in strict correlation to time. They also facilitate further exploration of artworks and events in certain days, months, seasons and years. Timelines This new browsing mechanism in the OPP, based on the Simile Timeline, was introduced by placing artworks as markers in a time frame. It was designed to allow users to examine the artworks produced, along with the recorded events of a given year. Further modifications were necessary, since the Simile Timeline was designed to support single events occurring in one day. Initially, the timeline was designed to focus only on Picasso’s artworks. This design choice gave users great freedom to explore large amounts of information in a manipulatable visual space. However, the biographical events were being excluded. These events included in the OPP, are particularly important since provide a historical framework, which is crucial to the understanding of the artist’s legacy and are tightly bound to his work rhythm. Moreover, some of the artworks produced by Picasso have a certain degree of uncertainty in their dates of creation, since their start and end dates were not documented. The timelines provide a mechanism for dealing with uncertainty, where the artworks are represented with a time bar with a lower level of saturation in their color. This gives a visual clue that the start and end dates are not fixed, and are subject to speculation. Additional information such as changes in style and location were injected to the timeline, which were extracted from the artist’s biography. Their purpose is to provide an additional layer of information that can be used to interpret the events that lead to the creation and mood of certain artifacts, and thus enhancing the browsing environment. The advantages gained through the use of timelines include: 1. The possibility of grasping visually time-extensions in Picasso’s output. Picasso worked on several artworks at times, which share a similar theme. Even though they share common characteristics, they are not identical. Each of these artworks has variations, which differentiate them. On the other hand, the timelines allow users to freely explore all the artworks and events within a given year, and point out their similarities and differences, and affords further examination regarding the evolution of a common and shared theme. 2. The possibility of visually comparing works ordered in chronological order. The timelines provide a mechanism that filters artworks according to their year of creation. The enhanced navigational scheme provided, allows scholars to view artifacts in chronological order. The addition of surrogates allows users to point out a specific item, and then compare them in relation to others through and their time correlation. 3. The possibility of seeing correlations between change of location and artwork creation. The deployed timelines allow the exploration of correlations between location changes and the creation of specific artworks. Changes in location are marked in the timelines, and clearly denote a point in time where exposure to a new or recurring context occurred. Figure 4: Changes in location Figure 3: Exploring artwork - event correlation 4. The possibility of comparing different stylistic periods as they relate to concrete artworks and specific locations. _____________________________________________________________________________ 162 Digital Humanities 2008 _____________________________________________________________________________ The timelines produce a visualization that puts changes in thematic periods and in locations in a common context, along with the artworks that were elaborated in that time span. This tool is augmented with the navigational ease of clicking through a series of artworks, to compare their characteristics and perform a deeper analysis if necessary. The interfaces have been deployed taking into account that additional functionality could be introduced with ease. As a consequence, information regarding Picasso’s writings and poems will be included into the next iteration of the timelines and calendars.This will allow a deeper understanding of his legacy, since it could potentially provide a greater understanding of his artworks and biography. The writings and poems constitute a compendium of his thoughts and insights, extremely valuable because they were written by the artist himself. The timeline interfaces in the OPP, narrow the gap and visually correlate biographical entries with artworks. They provide scholars a bigger picture of Picasso’s artistic landscape and how events they could have affected his artworks.The dynamic nature of the web-accessible interfaces facilitate the insertion of new documents and metadata and thus altering the graphical space, which is not feasible on static and printed editions. References The Online Picasso Project. Available at http://picasso.tamu. edu. [Accessed May 17, 2007] SIMILE Project. Available at http://simile.mit.edu/. [Accessed on May 24, 2007] Monroy, R. Furuta, and E. Mallen, “Visualizing and Exploring Picasso’s World,” in Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital Libraries, 2003, IEEE Computer Society, pp. 173- 175. Kumar, R. Furuta, and R. Allen, “Metadata visualization for digital libraries: interactive timeline editing and review,” in Proceedings of the third ACM conference on Digital Libraries, 1998, ACM Press, pp. 126 – 133. Monroy, R. Furuta, E. Urbina, and E. Mallen, “Texts, Images, Knowledge:Visualizing Cervantes and Picasso,” in Proceedings of the Visual Knowledges Conference, 2003, John Frow, www. iash.ed.ac.uk/vkpublication/monroy.pdf. N. Audenaert, U. Karadkar, E. Mallen, R. Furuta, and S. Tonner, “Viewing Texts: An Art-centered Representation of Picasso’s Writings,” In Proceedings of Digital Humanities 2007, 2007. C. Monroy, R. Furuta, and E. Mallen, “Creating,Visualizing, Analyzing, and Comparing Series of Artworks,” 2003, Unpublished paper. Computer Assisted Conceptual Analysis of Text: the Concept of Mind in the Collected Papers of C.S. Peirce Jean-Guy Meunier meunier.jean-guy@uqam.ca Université du Quebec à Montréal, Canada Dominic Forest dominic.forest@umontreal.ca Université de Montréal, Canada Computer assisted reading and analysis of text (CARAT) has recently explored many variants of what has become fashionable to call “text mining” strategies. Text mining strategies are theoretically robust on large corpus. However, since they mainly operate at a macro textual level, their use is still the object of resistance by the expert readers that aim at fine and minute conceptual analysis. In this paper, we present a computer assisted strategy for assisting conceptual analysis based on automatic classification and annotation strategies. We also report on experiment using this strategy on a small philosophical corpus. Conceptual analysis is an expert interpretation methodology for the systematic exploration of semantic and inferential properties of set of predicates expressing a particular concept in a text or in a discourse (Desclés, 1997; Fodor, 1998; Brandom, 1994; Gardenfors, 2000; Rastier, 2005). Computer assisted reading and analysis of text (CARAT) is the computer assistance of this conceptual analysis. The strategy of CACAT Our text analysis strategy rests on the following main hypothesis: The expression of a canonical concept in a text presents linguistics regularities some of which can be identified using classification algorithms This hypothesis itself unwraps into three sub hypothesis: Hypothesis 1: conceptual analysis can be realized by the contextual exploration of the canonical forms of a concept This is realized through the classical concordance strategy and variances on a pivotal term and its linguistic variants (e.g. mind, mental, mentally, etc.) (Pincemin et al., 2006; McCarthy, 2004; Rockwell, 2003). _____________________________________________________________________________ 163 Digital Humanities 2008 _____________________________________________________________________________ Hypothesis 2: the exploration of the contexts of a concept is itself realized through some mathematical classification strategy. This second hypothesis postulates that contexts of a concept present regularities that can be identified by mathematical clustering techniques that rest upon similarities found among contextual segments (Jain et al. 1999; Manning and Schütze, 1999). was applied. It generated 83 clusters with a mean 8.3 segments per class. It is possible to represent spatially the set of words in each class. Figure 1 illustrates such a regrouping for cluster 1. Hypothesis 3: Classes of conceptual of similar conceptual contexts can be annotated so as to categorize their semantic content. This last hypothesis allows to associate to each segment of a class of contexts some formal description of their content be it semantic, logical, pragmatic, rhetorical, etc. (Rastier et al., 2005; Djioua and Desclés, 2007; Meyers, 2005; Palmer et al., 2005; Teich et al., 2006). Some of these annotations can be realized through algorithms; others can only be done manually. Experiment From these three hypotheses emerges an experiment which unwraps in five phases. This experiment was accomplished using C.S. Peirce’s Collected Papers (volumes I-VIII) (Peirce, 1931, 1935, 1958). More specifically, this research aimed at assisting conceptual analysis of the concept of “Mind” in Peirce’s writings. Phase 1:Text preparation In this methodology, the first phase is text pre-processing. The aim of this first phase is to transform the initial corpus according to phases 2 and 3 requirements. Various operations of selection, cleaning, tokenisation, and segmentation are applied. In the experiment we report on, no lemmatisation or stemming was used.The corpus so prepared was composed of 74 450 words (tokens) with a lexicon of 2 831 word types. Phase 2: Key Word In Context (KWIC) extraction (concordance) Using the corpus pre-processed in phase 1, a concordance is made with the pivotal word “Mind”. The KWIC algorithm generated 1 798 contextual segments of an average of 7 lines each. In order to be able to manually evaluate the results of the computer-assisted conceptual analysis, we decided to select in the project only a random sampling of the 1 798 contextual segments. The sampling algorithm delivered 717 contextual segments. This sample is composed of 3 071 words (tokens) and 1 527 type words. Phase 3: KWIC clustering The concordance is in itself a subtext (of the initial corpus). A clustering technique was applied to the concordance results. In this project, a hierarchical agglomerative clustering algorithm Figure 1. Graphical representation of cluster 1 lexicon. It is often on this type of representation that many numerical analyses start their interpretation. One traditional critic presented by expert analysts is their great generality and ambiguity. This kind of analysis and representation give hints on the content of documents, but as such it is difficult to use for fine grained conceptual analysis. It must hence be refined. It is here that the annotation phase comes into play. Phase 4: Annotation The annotation phase allows the expert reader to make more explicit the type of information contained in each clusters (generated in phase 3). For instance, the interpreter may indicate if each cluster is a THEME, a DEFINITION, a DESCRIPTION, an EXPLANATION, an ILLUSTRATION, an INFERENCE, or what is it MODALITY (epistemic, epistemological, etc.). The variety of annotation types is in itself a research object and depends on various textual and linguistic theories. Annotation results In this abstract, size constraints do not allow us here to present detailed results of classification and annotation processes. We shall only present a sample on a few segments of three classes. Annotations of cluster 1: The first cluster contained 17 segments all of which have received an annotation. Here are samples of annotation for two segments of cluster 1. The annotation is preceded by the citation itself from the original text. [SEGMENT NO 512] “Finally laws of mind divide themselves into laws of the universal action of mind and laws of kinds of psychical manifestation.” ANNOTATION: DEFINITION: the law of mind is a general action of the mind and a psychological manifestation _____________________________________________________________________________ 164 Digital Humanities 2008 _____________________________________________________________________________ [SEGMENT NO 1457] “But it differs essentially from materialism, in that, instead of supposing mind to be governed by blind mechanical law, it supposes the one original law to be the recognized law of mind, the law of association, of which the laws of matter are regarded as mere special results.” ANNOTATION: EXPLICATION: The law of mind is not a mechanical materialism. Phase 5: Interpretation The last phase is the interpretative reading of the annotations. Here, the interpreter situates the annotated segments into his own interpretative world. He may regroup the various types of annotation (DEFINITIONS, EXPLANTIONS, etc.) and hence build a specific personal data structure on what he has annotated. From then on, he may rephrase these in his own language and style but most of all situate them in some theoretical, historical, analytical, hermeneutic, epistemological, etc. perspective. It is the moment where the interpreter generates his synthesis of the structure he believes underlies the concept. We present here a sample of the synthesis of conceptual analysis assisted by the CARAT process on cluster 1 (the concept of “mind” in C.S. Peirce’s writings – cluster 1). The law of Mind: association The Peircian theory of MIND postulates that a mind is governed by laws. One of these laws, a fundamental one, is associative (segment 512). This law describes a habitus acquired by the mind when it functions (segment 436). Association is connectivity This functioning is one of relation building through connections. The connectivity is of a specific nature. It realizes a synthesis (à la Kant) which is a form of “intellectual” generalisation (segment 507). It is physically realized Such a law is also found in the biological world. It is a law that can be understood as accommodation (segment 1436). In fact, this law is the specific form of the Mind’s dynamic. It is a fundamental law. But it is not easy for us to observe it because we are victim of a interpretative tradition (segment 1330) that understands the laws of mind as laws of nature.This is a typical characteristic of an “objective idealism” (segments 1762 and 1382).The laws of mind do not belong to mechanist materialism (segments 90 and 1382). And there exist a variety of categories There exist subdivisions of this law. They are related to the generalisation process that is realised in infanthood, education, and experience. They are intimately related to the growth of consciousness (segments 375 and 325). Conclusion This research project explores a Computer-Assisted Reading and Analysis of Text (CARAT) methodology. The classification and annotation strategies manage to regroup systematically segments of text that present some content regularity. This allows the interpreter to focus directly on the organized content of the concept under study. It helps reveal its various dimensions (definitions, illustrations, explanations, inferences, etc.). Still, this research is ongoing. More linguistic transformations should be applied so as to find synonymic expressions of a concept. Also, various types of summarization, extraction and formal representation of the regularities of each class are to be explored in the future. But the results obtained so far reinstate the pertinence of the concordance as a tool for conceptual analysis. But it situates it in a mathematical surrounding that aim at unveiling the various dimensions of a conceptual structure. Most of all, we believe that this methodology may possibly interest expert readers and analysis for it gives a strong handle and control on their interpretation process although assisting them throughout the process. References Brandom, R.B. (1994). Making it Explicit. Cambridge: Harvard University Press. Desclés, Jean-Pierre (1997) “Schèmes, notions, predicats et termes”. Logique, discours et pensée, Mélanges offerts à JeanBlaize Grize, Peter Lang, 9-36. 47. Djioua B. and Desclés, J.P. (2007), “Indexing Documents by Discourse and Semantic Contents from Automatic Annotations of Texts”, FLAIRS 2007, Special Track “Automatic Annotation and Information Retrieval : New Perspectives”, Key West, Florida, May 9-11. Fodor, J. (1998) Concepts:Where Cognitive Science Went Wrong. Oxford: OUP. Gardenfors, P. (2000) Conceptual Spaces. Cambridge (Mass.): MIT Press. Jain, et al. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3):264–323. Manning, C. and Schutze, H. (1999) Foundations of Statistical Natural Language Processing, Cambridge Mass. : MIT Press. Meyers, Adam (2005) Introduction to Frontiers in CorpusAnnotation II Pie in the Sky Proceedings of the Workshop on Frontiers in Corpus Annotation II: Pie in the Sky, pages 1–4, New York University Ann Arbor, June 2005. _____________________________________________________________________________ 165 Digital Humanities 2008 _____________________________________________________________________________ McCarthy, W. (2004) Humanities Computing, Palgrave MacMillan Blackwell Publishers. Palmer, M., Kingsbury, P., Gildea, D. (2005) “The Proposition Bank: An Annotated Corpus of Semantic Roles”. Computational Linguistics, vol. 31, no 1, pp. 71-106. Peirce, C.S. (1931-1935, 1958), Collected Papers of Charles Sanders Peirce, vols. 1–6, Charles Hartshorne and Paul Weiss (eds.), vols. 7–8, Arthur W. Burks (ed.), Harvard University Press, Cambridge, MA, 1931–1935, 1958. Pincemin, B. et al. (2006). Concordanciers: thème et variations, in J.-M.VIPREY (éd.), Proc. of JADT 2006, pp. 773784. Rastier, F. (2005) Pour une sémantique des textes théoriques. Revue de sémantique et de pragmatique, 17, 2005, pp. 151-180. Rastier, F. et al. (eds) (1995) L’analyse thématique des données textuelles: l’exemple des sentiments. Paris: Didier Érudition. Rockwell, G. (2003) What is text analysis, really? Literary and Linguistic Computing, 18(2): 209–219. Topic Maps and Entity Authority Records: an Effective Cyber Infrastructure for Digital Humanities Jamie Norrish Jamie.Norrish@vuw.ac.nz New Zealand Electronic Text Centre, New Zealand Alison Stevenson alison.stevenson@vuw.ac.nz New Zealand Electronic Text Centre, New Zealand The implicit connections and cross-references between and within texts, which occur in all print collections, can be made explicit in a collection of electronic texts. Correctly encoded and exposed they create a framework to support resource discovery and navigation by following links between topics.This framework provides opportunities to visualise dense points of interconnection and, deployed across otherwise separate collections, can reveal unforeseen networks and associations. Thus approached, the creation and online delivery of digital texts moves from a digital library model with its goal as the provision of access, to a digital humanities model directed towards the innovative use of information technologies to derive new knowledge from our cultural inheritance. Using this approach the New Zealand Electronic Text Centre (NZETC) has developed a delivery system for its collection of over 2500 New Zealand and Pacic Island texts using TEI XML, the ISO Topic Map technology1 and innovative entity authority management. Like a simple back-of-book index but on a much grander scale, a topic map aggregates information to provide binding points from which everything that is known about a given subject can be reached. The ontology which structures the relationships between dierent types of topics is based on the CIDOC Conceptual Reference Model2 and can therefore accommodate a wide range of types. To date the NZETC Topic Map has included only those topics and relationships which are simple, veriable and object based. Topics currently represent authors and publishers, texts and images, as well as people and places mentioned or depicted in those texts and images. This has proved successful in presenting the collection as a resource for research, but work is now underway to expand the structured mark-up embedded in texts to encode scholarly thinking about a set of resources. Topic-based navigable linkages between texts will include ‘allusions’ and ‘influence’ (both of one text upon another and of an abstract idea upon a corpus, text, or fragment of text).3 Importantly, the topic map extends beyond the NZETC collection to incorporate relevant external resources which expose structured metadata about entities in their collection (see Figure 1). _____________________________________________________________________________ 166 Digital Humanities 2008 _____________________________________________________________________________ Cross-collection linkages are particularly valuable where they reveal interdisciplinary connections which can provide fertile ground for analysis. For example the National Library of New Zealand hosts a full text archive of the Transactions and Proceedings of the Royal Society containing New Zealand science writing 1868-1961. By linking people topics in the NZETC collection to articles authored in the Royal Society collection it is possible to discern an interesting overlap between the 19th century community of New Zealand Pakeha artists and early colonial geologists and botanists. In order to achieve this interlinking, between collections, and across institutional and disciplinary boundaries, every topic must be uniquely and correctly identied. In a large, full text collection the same name may refer to multiple entities,4 while a single entity may be known by many names.5 When working across collections it is necessary to be able to condently identify an individual in a variety of contexts. Authority control is consequently of the utmost importance in preventing confusion and chaos. The library world has of course long worked with authority control systems, but the model underlying most such systems is inadequate for a digital world. Often the identier for an entity is neither persistent nor unique, and a single name or form of a name is unnecessarily privileged (indeed, stands in as the entity itself). In order to accommodate our goals for the site, the NZETC created the Entity Authority Tool Set (EATS),6 an authority control system that provides unique, persistent, sharable7 identiers for any sort of entity. The system has two particular benets in regards to the needs of digital humanities researchers for what the ACLS described as a robust cyber infrastructure.8 Firstly, EATS enables automatic processing of names within textual material.When dealing with a large collection, resource constraints typically do not permit manual processing -for example, marking up every name with a pointer to the correct record in the authority list, or simply recognising text strings as names to begin with. To make this process at least semi-automated, EATS stores names broken down (as much as possible) into component parts. By keeping track of language and script information associated with the names, the system is able to use multiple sets of rules to know how to properly glue these parts together into valid name forms. So, for example, William Herbert Ellery Gilbert might be referred to in a text by “William Gilbert”, “W. H. E. Gilbert”, “Gilbert,Wm.”, or a number of other forms; all of these can be automatically recognised due to the language and script rules associated with the system. Similarly Chiang Kai-shek, being a Chinese name, should be presented with the family name first, and, when written in Chinese script, without a space between the name parts (蒋介石). Figure 1: A mention of Samuel Marsden in a given text is linked to a topic page for Marsden which in turn provides links to other texts which mention him, external resources about him and to the full text of works that he has authored both in the NZETC collection and in other online collections entirely separate from the NZETC. The ability to identify entities within plain text and add structured, machine-readable mark-up contributes to the growth of electronic text corpora suitable for the types of computational analysis oered by projects such as the MONK environment.9 This is, however, distinct from the problem of identifying substrings within a text that might be names, but that are not found within EATS.This problem, though signicant, does not fall within the main scope of the EATS system.10 Similarly, disambiguating multiple matches for the same name is generally best left to the determination of a human being: even date matches are too often problematic.11 Secondly, the system is built around the need to allow for an entity to carry sometimes conflicting, or merely dierent, information from multiple sources, and to reference those sources.12 Having information from multiple sources aids in the process of disambiguating entities with the same names; just as important is being able to link out to other relevant resources. For example, our topic page for William Colenso links not only to works in the NZETC collection, but also to works in other collections, where the information on those other collections is part of the EATS record. It is, however, barely sucient to link in this way directly from one project to another. EATS, being a web application, can itself be exposed to the net and act as a central hub for information and resources pertaining to the entities within the system. Since all properties of an entity are made as assertions by an organisation, EATS allows multiple such organisations to use and modify records without touching anyone else’s data; adding data harvesting to the mix allows for centralisation of information (and, particularly, pointers to further information) without requiring much organisational centralisation. _____________________________________________________________________________ 167 Digital Humanities 2008 _____________________________________________________________________________ One benefit of this approach is handling entities about which there is substantial dierence of view. With topics derived from research (such as ideas and events) there are likely to be dierences of opinion as to both the identification of entities and the relationships between them. For example one organisation may see one event where another sees two. To be able to model this as three entities, with relationships between them asserted by the organisations, a potentially confusing situation becomes clear, without any group having to give up its own view of the world. The EATS system can achieve this because all information about an entity is in the form of a property assertion made by a particular authority in a particular record (see figure 2). The technologies developed and deployed by the NZETC including EATS are all based on open standards. The tools and frameworks that have been created are designed to provide durable resources to meet the needs of the academic and wider community in that they promote interlinking between digital collections and projects and are themselves interoperable with other standards-based programs and applications including web-based references tools, eResearch virtual spaces and institutional repositories. Notes 1 For further information see Conal Tuhoy’s \Topic Maps and TEI | Using Topic Maps as a Tool for Presenting TEI Documents” (2006) http://hdl.handle.net/10063/160. 2 The CIDOC CRM is an ISO standard (ISO 21127:2006) which provides denitions and a formal structure for describing the implicit and explicit concepts and relationships used in cultural heritage documentation. For more information see http://cidoc.ics.forth.gr/. 3 The initial project work is being undertaken in collaboration with Dr Brian Opie from Victoria University of Wellington and is centred around the work and influences of William Golder, the author of the first volume of poetry printed and published in New Zealand. 4 For example, the name Te Heuheu is used in a number of texts to refer to multiple people who have it as part of their full name. 5 For example the author Iris Guiver Wilkinson wrote under the pseudonym Robin Hyde. 6 For more analysis of the weakness of current Library standards for authority control and for more detail information on EATS see Jamie Norrish’s \EATS: an entity authority tool set” (2007) at http://www. nzetc.org/downloads/eats.pdf. 7 Being a web-based application the identiers are also dereferencable (ie resolve a web resource about the entity) and therefore can be used as a resource by any web project. 8 “Our Cultural Commonwealth: The final report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities & Social Sciences” (2006) http://www.acls.org/ cyberinfrastructure/OurCulturalCommonwealth.pdf. 9 Metadata Oer New Knowledge http://www.monkproject.org/. 10 EATS can be provided with name data on which to make various judgements (such as non-obvious abbreviations like Wm for William), and it would be trivial to get a list of individual parts of names from the system, for identication purposes, but there is no code for actually performing this process. 11 That said, the NZETC has had some success with automated ltering of duplicate matches into dierent categories, based on name and date similarity (not equality); publication dates provide a useful cut-o point for matching. 12 A source may be a primary text or an institution’s authority record. Figure 2:The EATS objects and basic relationships Only once both cultural heritage institutions and digital humanities projects adopt suitable entity identiers and participate in a shared mapping system such as EATS, can there exist unambiguous discovery of both individual resources and connections between them. The wider adoption of this type of entity authority system will contribute substantially to the creation of the robust cyber infrastructure that will, in the words of the ACLS “allow digital scholarship to be cumulative, collaborative, and synergistic.” _____________________________________________________________________________ 168 Digital Humanities 2008 _____________________________________________________________________________ 2D and 3D Visualization of Stance in Popular Fiction Lisa Lena Opas-Hänninen lisa.lena.opas-hanninen@oulu.fi University of Oulu, Finland Tapio Seppänen tapio@ee.oulu.fi University of Oulu, Finland Mari Karsikas mmantysa@mail.student.oulu.fi University of Oulu, Finland Suvi Tiinanen stiinane@mail.student.oulu.fi University of Oulu, Finland Introduction In analyzing literary style through statistical methods, we often show our results by plotting graphs of various types, e.g. scatterplots, histograms, and dendrograms. While these help us to understand our results, they always necessarily “flatten” the data into a 2- dimensional format. For the purposes of visualization of complex data, we might be better off trying to look at it three dimensionally if a suitable vector representation of the data is found first.This paper investigates stance in three groups of popular fiction, namely romance fiction, detective fiction written by female authors and detective fiction written by male authors, and presents the results using both 2 dimensional and three dimensional visualization. Romance fiction and detective novels are both characterized by the fact that they all have the same basic story. In romance fiction it is the story of how the hero and heroine meet, how their relationship develops and how the happy ending is achieved. In detective fiction it is the story of the murder, the unraveling of the case, the misleading clues and the final solution. The same story is being retold over and over again, just as in the oral storytelling tradition (Radway 1984:198).The reader is not left in suspense of the final outcome and each story is different from the others in the details of the events and the way the story is told.Through this the reader becomes involved in the story and partakes in the emotions, attitudes and thoughts of the protagonist. These feelings, emotions and moods are marked by syntactic and semantic features often referred to as markers of stance. Stance refers to the expression of attitude and consists of two different types of expressions of attitude: evidentiality and affect (Biber and Finegan 1989). Evidentiality means that the reader becomes privy to the speaker’s attitudes towards whatever knowledge the speaker has, the reliability of that knowledge and how the speaker came about that knowledge. Affect refers to the personal attitudes of the speaker, i.e. his/ her emotions, feelings, moods, etc. Biber and Finegan (1989) investigated 12 different categories of features deemed to mark stance: certainty/doubt adverbs, certainty/doubt verbs, certainty/doubt adjectives, affective expressions, hedges, emphatics and necessity/possibility/predictive modals. They showed that different text types are likely to express stance in different ways. Opas and Tweedie (1999) studied stance in romance fiction and showed that three types of romance fiction can be separated by their expression of stance. This paper continues these studies, paying special attention to the visualization of the results. Materials and methods Our total corpus is 760 000 words. It consists of three parts: romance fiction, female-authored detective fiction and maleauthored detective fiction, all published in the 1990s. The romance fiction comprises a total of 240 000 words from the Harlequin Presents series, the Regency Romance series and Danielle Steel’s works, which are often classified as women’s fiction or ’cross-overs’. The female-authored detective fiction part of our corpus includes Patricia Cornwell, Sue Grafton, P.D. James, Donna Leon, Ellis Peters, and Ruth Rendell. These texts make up 295 000 words.The rest of the corpus (229 000 words) is made up of male-authored detective fiction texts, including Colin Dexter, Michael Dibdin, Quintin Jardine, Ian Rankin and Peter Tremayne. Principal components analysis was used to reduce the 12 markers of stance to three dimensions which describe 52.4% of the variation in the data. Results In a previous study Opas and Tweedie (2000) concluded that the detective stories seem to mark evidentiality, i.e. the characters express their certainty and doubt. The romance stories, on the other hand, seem to mainly mark affect, i.e. the characters express their emotions and moods. In Biber and Finegan’s terms (1989), the detective fiction texts show ‘interactional evidentiality’ and the romance fiction texts show ‘emphatic expression of affect’. The results in Figures 1 and 2 below show the female-authored detective stories in shades of red and yellow, the male-authored detective stories in shades of blue and the romance stories in shades of black. While in Figure 1 the female-authored texts are perhaps slightly more clearly separatable from the others, it still seems that all these texts are far more overlapping and less clearly distinguishable as three groups than the texts in previous studies were. However, broadly speaking, the romance texts are in the lower half of the graph, the female-authored detective texts in the upper half and the male-authored detective texts in the middle. What is surprising though is that no features seem to “pulling” texts downwards, towards the lower half of the graph; and that the feature that “pull” the texts upwards include both markers of certainty/doubt and affect. _____________________________________________________________________________ 169 Digital Humanities 2008 _____________________________________________________________________________ References Biber, D. and E. Finegan (1989) Styles of Stance in English: Lexical and Grammatical Marking of Evidentiality and Affect. Text 9.1: 93-124. Opas, L.L. and F.J. Tweedie (1999) The Magic Carpet Ride: Reader Involvement in Romantic Fiction. Literary and Linguistic Computing 14.1: 89-101. Figure 1. PC1 and PC2 for detective and romance fiction Opas, L.L. and F.J. Tweedie (2000) Come into My World: Styles of stance in detective and romantic fiction. Poster. ALLCACH2000. Glasgow, UK. Radway, J.A. (1984) Reading the Romance. Chapel-Hill: University of North Carolina Press. Figure 2. PC2 and PC3 for detective and romance fiction Figure 2 seems to show similar results. Here, perhaps, the female-authored detective texts are even slightly more easily separated from the others, but the general impression of the male-authored texts and the romance texts overlapping remains. Yet again it seems that there are hardly any features accounting for the texts on the left-hand side of the graph and that the feature “pulling” the texts to the right include both features marking evidentiality and those marking affect. These results are quite surprising. Either male-authored detective stories mark evidentiality and affect in the same manner as romance fiction, and here is seems that they don’t show many features that would mark stance, or there is something more complex at work here.To help us understand these phenomena better, we would suggest visualizing the workings of the markers of stance in a 3 dimensional model. To this end, we have built a tool that takes the principal components analysis data, reduces dimensionality down to three components with the most energy and presents the data with these components. The software tool is implemented in the MATLAB® environment (The MathWorks, Inc., Massachusetts, USA) utilizing its 3D graphical functions. The tool is an interactive one, allowing the researcher to turn the 3D model and look for the angles that best show the clustering structure and the differences between the texts. We will demonstrate how tools such as this one significantly improve the researcher’s ability to visualize the research results and to interpret them. _____________________________________________________________________________ 170 Digital Humanities 2008 _____________________________________________________________________________ TEI Analytics: a TEI Format for Cross-collection Text Analysis Stephen Ramsay sramsay@unlserve.unl.edu University of Nebraska-Lincoln, USA Brian Pytlik-Zillig bpytlikz@unlnotes.unl.edu University of Nebraska-Lincoln, USA The Monk Project (http://www.monkproject.org/) has for the last year been developing a Web-based system for undertaking text analysis and visualization with large, full-text literary archives. The primary motivation behind the development of this system has been the profusion of sites like the Brown Women Writers Project, Perseus Digital Library, Wright American Fiction, Early American Fiction, and Documenting the American South, along with the vast literary text corpora oered by organizations like the Text Creation Partnership and Chadwyck-Healey. Every one of these collections represents fertile ground for text analysis. But if they could be somehow combined, they would constitute something considerably more powerful: a literary full text-corpus so immense as to be larger than anything that has yet been created in the history of computing. The obstacles standing in the way of such a corpus are well known. While all of the collections mentioned above are encoded in XML and most of them are TEI-conformant, local variations can be so profound as to prohibit anything but the most rudimentary form of cross-collection searching. Tagset inclusions (and exclusions), local extensions, and local tagging and metadata conventions dier so widely among archives, that it is nearly impossible to design a generalized system that can cross boundaries without a prohibitively cumbersome set of heuristics. Even with XML and TEI, not all texts are created equal. TEI Analytics, a subset of TEI in which varying text collections can be expressed, grew out of our desire to make MONK work with extremely large literary text corpora of the sort that would allow computational study of literary and linguistic change over broad periods of time. TEI Analytics Local text collections vary not because archive maintainers are contemptuous toward standards or interoperability, but because particular local circumstances demand customization. The nature of the texts themselves may require specialization, or something about the storage, delivery, or rendering framework used may favor particular tags or particular structures. Local environments also require particular metadata conventions (even within the boundaries of the TEI header). This is in part why the TEI Consortium provides a number of pre-fabricated customizations, such as TEI Math and TEI Lite, as well as modules for Drama, transcriptions of speech, and descriptions of manuscripts. Roma (the successor to Pizza Chef) similarly allows one to create a TEI subset, which in turn may be extended for local circumstances. TEI Analytics, which is itself a superset of TEI Tite, is designed with a slightly dierent purpose in mind. If one were creating a new literary text corpus for the purpose of undertaking text analytical work, it might make the most sense to begin with one of these customizations (using, perhaps, TEI Corpus). In the case of MONK, however, we are beginning with collections that have already been tagged using some version of TEI with local extensions. TEI Analytics is therefore designed to exploit common denominators in these texts while at the same time adding new structures for common analytical data structures (like part-of-speech tags, lemmatizations, named-entities, tokens, and sentence markers). The idea is to create a P5compliant format that is designed not for rendering, but for analytical operations such as data mining, principle component analysis, word frequency study, and n-gram analysis. In the particular case of MONK, such documents have a relatively brief lifespan; once documents are converted, they are read in by a system that stores the information using a combination of object-relational database technology and binary indexing. But before that can happen, the texts themselves need to be analyzed and re-expressed in the new format. Implementation Our basic approach to the problem involves schema harvesting. The TEI Consortium’s Roma tool (http://tei.oucs.ox.ac.uk/ Roma/) was first used to create a base W3C XML schema for TEI P5 documents, which we then extended using a custom ODD file. With this basis in place, we were able to create an XSLT “meta-stylesheet” (MonkMetaStylesheet.xsl) that consults the target collection’s W3C XML schema to determine the form into which the TEI P4 les should be converted. This initial XSLT stylesheet is a meta-stylesheet in the sense that it programatically authors another XSLT stylesheet. This second stylesheet (XMLtoMonkXML.xsl), which is usually thousands of lines long, contains the conversion instructions to get from P4 to the TEI Analytics’s custom P5 implementation. Elements that are not needed for analysis are removed or re-named according to the requirements of MONK (for example, numbered <div>s are replaced with un-numbered <div>s). Bibliographical information is critical for text analysis, and both copyright and responsibility information must be maintained, but much of the information contained in the average <teiHeader> (like revision histories and records of workflow) are not relevant to the task. For this reason, TEI Analytics uses a radically simplied form of the TEI header. _____________________________________________________________________________ 171 Digital Humanities 2008 _____________________________________________________________________________ Here is a sample template in the meta-stylesheet (in a somewhat abbreviated form): <xsl:template match=”xs:element[@ name=$listOfAllowableElements/*]”> <!-- elements are identified in the MONK shema, and narrowed to list of allowable elements--> <xsl:element name=”xsl:template”> <!-- begins writing of ‘xsl:template’ elements in the final XMLtoMonkXML.xsl stylesheet --> <xsl:attribute name=”match”> <!-- begins writing of the (approximately 122 unique) match attributes on the ‘xsl:template’ elements --> <xsl:choose> <xsl:when test=”$attributeName = $attributeNameLowercase”> <xsl:value-of select=”@name”/> </xsl:when> <xsl:otherwise> <xsl:value-of select=”concat(@ name,’ | ‘,lower-case(@name))”/> </xsl:otherwise> </xsl:choose> </xsl:attribute> <!-- ends writing of the match attributes on the ‘xsl:template’ elements --> <xsl:when> <xsl:attribute name=”test”> <xsl:value-of select=”concat(‘@’,.)”/> </xsl:attribute> <xsl:copy-of> <xsl:attribute name=”select”> <xsl:value-of select=”concat(‘@’,.)”/> <!-- copies the element’s attributes, constrained to a list of attributes desired by MONK --> </xsl:attribute> </xsl:copy-of> </xsl:when> </xsl:for-each> <xsl:otherwise> </xsl:otherwise> <!-- any zero-length strings (in the dynamically-generated list of associated attributes) are discarded --> </xsl:choose> </xsl:for-each> <xsl:apply-templates/> </xsl:element> <!-- ends writing of ‘xsl:template’ elements in the final XMLtoMonkXML.xsl stylesheet --> </xsl:template> All processes are initiated by a program (called Abbot) that performs, in order, the following tasks: 1. Generates the XMLtoMonkXML.xsl stylesheet <xsl:element name=”{$attributeName}”> 2. Edits the XMLtoMonkXML.xsl stylesheet to add the proper schema declarations in the output files. <!-- writes the unique contents of each ‘xsl:template’ element in the XMLtoMonkXML.xsl stylesheet --> 3. Converts the entire P4 collection to MONK’s custom P5 implementation. <xsl:for-each select=”$associ atedAttributeList/list”> <xsl:choose> <xsl:for-each select=”child:: item[string-length(.) > 0]”> <!-- all strings (in the dynamicallygenerated list of associated attributes) greater than zero are processed --> 4. Removes any stray namespace declarations from the output files, and 5. Parses the converted les against the MONK XML schema. These steps are expressed in BPEL (Business Process Execution Language), and all source files are retained in the processing sequence so that the process can be tuned, adjusted, and rerun as needed without data loss.The main conversion process takes, depending on the hardware, approximately 30 minutes for roughly 1,000 novels and yields les that are then analyzed and tagged using Morphadorner (a morphological tagger developed by Phil Burns, a member of the MONK Project at _____________________________________________________________________________ 172 Digital Humanities 2008 _____________________________________________________________________________ Northwestern University). Plans are underway for a plugin architecture that will allow one to use any of the popular taggers (such as GATE or OpenNLP) during the analysis stage. The Dictionary of Words in the Wild Conclusion georock@mcmaster.ca McMaster University We believe that TEI Analytics performs a useful niche function within the larger ecology of TEI by making disparate texts usable within a single text analysis framework. Even without the need for ingestion into a larger framework, TEI Analytics facilitates text analysis of disparate source files simply by creating a consistent and unied XML representation. We also believe that our particular approach to the problem of XML conversion (a small stylesheet capable of generating massive stylesheets through schema harvesting) may be useful in other contexts|including, perhaps, the need to convert texts from P4 to P5. Willard McCarty Geoffrey Rockwell willard.mccarty@kcl.ac.uk King’s College London, UK Eleni Pantou-Kikkou King’s College London, UK Introduction The Dictionary of Words in the Wild [1] is an experiment in social textuality and the perceptual dynamics of reading. The Dictionary is a social image site where contributors can upload pictures of words taken “in the wild” and tag them so they are organized alphabetically as an online visual dictionary. Currently the Dictionary has 2227 images of 3198 unique words and 24 contributor accounts. The images uploaded and tagged are of text, usually single words or phrases, that appear in the everyday environment. Images uploaded include pictures of signs, body tatoos, garbage, posters, graffiti, labels, church displays, gravestones, plastic bags, clothing, art, labels, and other sights. The site is structured with an application programming interface to encourage unanticipated uses. In this paper we will, • Give a tour through the online social site and its API, • Discuss current scholarship on public textuality and the perceptual dynamics of reading, • Reflect on the Dictionary in these contexts, and • Conclude with speculations on its possible contributions to our understanding of textuality and reading. Outline of the Dictionary The following is a narrative of the demonstration part of the presentation. The Dictionary was developed with Ruby on Rails by Andrew MacDonald with support from the TAPoR project under the direction of Geoffrey Rockwell. Users can get a free account in order to start uploading images. (We discovered once the project was visible for a few months that spambots were automatically creating accounts so we have introduced a CAPTCHAlike graphical challenge-response feature to weed out false accounts.) _____________________________________________________________________________ 173 Digital Humanities 2008 _____________________________________________________________________________ When you upload a picture you are given the opportunity to crop it and are prompted to provide a list of words that appear in the text. You can also provide a brief description or discussion of the word image. Once uploaded the image is filed in a database and the interface allows you to access the images in different ways: • You can click on a letter and browse images with words starting with that letter. An image with “On” and “Off” will be filed twice, though at present the label is the first word tagged. • You can search for a word and see the images that have been tagged with that word. • You can type in a phrase and get a sequence of images. • You can see what comments have been left by others on your images and respond to them. The API allows the user to create other processes or text toys that can query the dictionary and get back an XML file with the URLs for word images like: <phrase> <word href=”href_for_image”>Word</word> <word>Word_with_no_image</word> </phrase> One of the goals of the project is to support mashups that use the dictionary for new projects. This is where the Dictionary is different than other such projects, many of which use Flickr to pull images of letters or words. Theoretical Discussion The Dictionary is an exploratory project designed to encourage the gathering of images of words in the wild and to provoke thinking about our encounters with these words. It did not start with an articulated theory of text that it set out to illustrate, nor in the contributors’ experience does the actual collecting of words tend to be governed by this or that theory. Rather, in its simplicity, the Dictionary encourages participants to document public textuality as they encounter and perceive it. At least in the initial phase of the project, the designers have imposed no rules or guidelines for the collecting, editing and tagging of words. Although it is clear that certain requirements for entering words into the Dictionary would make subsequent research questions easier to pursue, the designers prefer not to impose them so as to discover what in fact participants find to be interesting and practical to record. Because recording of words is voluntary and would seem inevitably to be limited to a few individuals, the time and effort required must be kept to a minimum in order to have a collection sufficiently large to allow the research potential of the Dictionary to emerge. The Dictionary is meant to provoke reflection on the actual verbal environment in its totality, on the moment-by-moment encounter with individual words and phrases where one finds them and on the experience of reading them as each reading unfolds. Collecting of verbal images for the Dictionary presents collectors with a way of defamiliarizing familiar environments. Conventional techniques for framing a photograph and digital tools for cropping it give them limited but surprisingly powerful means of recording defamiliarized sights. Additional means are provided by a commentary feature, but the amount of time required to compose this commentary tends to discourage much of it.Theoretical reflection on the encounter with words in the wild would seem to require participation in the project to be adequate to the data. For this reason collecting is not a task to be done separately from theorizing. Whatever theory is to emerge will come from participant observation. In some cases, we have found, what appears interesting to record is relatively static, requiring little compositional care or subsequent cropping. In many cases, however, the experience of reading is a dynamic event, as when part of a verbal message, not necessarily first in normal reading order, is perceived first, then submerged into the overall syntax of its verbal and/or visual environment. In other cases, the experience may include a specific detail of the environment in which the text of interest is embedded but not be informed significantly by other details. Occasionally one has sufficient time to frame a photograph to capture the experience adequately, but most often photographs must be taken quickly, as when unwelcome attention would be drawn to the act of photography or the situation otherwise advises stealth (an inscription on a t-shirt, for example). The Dictionary, one might say, is a record of psycholinguistic events as much or more than simply of environmental data. In more theoretical terms, is the project aims to study how language acts as a semiotic system materially placed in the real word. In order to interpret this multidimensional, “semiotic” role of language, our analysis focuses on how dictionary users perceive different signs and attribute meanings to words by referring to these signs. We will argue that through this kind of visual dictionary contributors can interact and play with language by using visual artifacts (photos, images, graffiti etc) to express and define the meanings of words. We have strong theoretical reasons for regarding text as co-created by the reader in interaction with the verbal signs of the document being read. The Dictionary gives the reader of words in the wild a means of implementing the act of reading as co-creative, but with a significant difference from those acts that have previously been the focus of theoretical work. The collector of wild words, like the reader of literature, is obviously not so much a viewer as a producer of meaning, but unlike the literary reader, the collector is operating in a textual field whose realworld context is highly unlikely ever to be otherwise recorded. It so obviously goes without saying that it also goes by and vanishes without ever being studied. _____________________________________________________________________________ 174 Digital Humanities 2008 _____________________________________________________________________________ The title of the project suggests a distinction between text in two different places: the kind at home and the kind in the wild. But the collector’s gaze rapidly suggests that the distinction is only in part one of place. Text at home can also be ‘wild’ if it can be defamiliarized, e.g. the title of a book on its spine taken in poor light conditions inside a house. The wildness of words, is then ‘in the eye of the beholder’, though the domestic environment is usually so well regulated that opportunities for perceiving wildness are far more limited than in relatively undomesticated environments. Thus such opportunities tend to occur far less frequently in well-kept or wealthy neighbourhoods than in poorly kept ones, where rubbish is frequently encountered and advertising of all kinds is evident. This suggests ironically that poorer neighbourhoods are in respect of the sheer amount of reading more rather than less literate. But the correlation between wealth and verbosity is not so straightforward. Airport lounges, for example, are rich in examples of wild words.What would seem to matter in this correlation is the acquisitive desire of the population: those who have tend not to read for more. Such theorizing, however, is clearly at a quite preliminary stage. The project calls for a more systematic ethnography of textuality and its everyday occurence. Insofar as it can be conceptualized, the long-term goal of the project can be put as a question: would it be possible to develop a panoptic topology of the appearance of the legible in everyday life, if even just for one person? Similar Work The paper will conclude with some images and the future directions they have provoked: • We need to parse phrases so that we remove punctuation. For example, “faith,” won’t find the image for “faith”. • We need to allow implicit words to be entered with parentheses where the word doesn’t appear, but is implicit. An example would be http://tapor1-dev.mcmaster. ca/~dictwordwild/show/694 which is filed under “Average” even though the word doesn’t appear. • We need to allow short phrasal verbs and compounds to be entered with quotation marks so they are filed as one item. An example would be “come up” or “happy days”. • We need to allow images of longer passages to identified as “Sentences in the Sticks”, “Phrases in the Fields” or “Paragraphs in the Pastures”. These would not be filed under individual words, but the full text could be searched. • We need to allow people to control capitalization so that, for example, “ER” (which stands for “Emergency Room”) is not rendered as “Er”. • We need to let people add tags that are not words so images can be sorted according to categories like “Graffiti” or “Billboard”. Links The Dictionary is one of a number of projects that use the internet to share images of textuality. For example,Typography Kicks Ass: Flickr Bold Italic [2] is Flash toy that displays messages left by people using letters from Flickr. The London Evening Standard Headline Generator [3] from thesurrealist.co.uk generates headlines from a Flickr set of images of headlines. IllegalSigns.ca [4] tracks illegal billboards in Toronto and has a Clickable Illegal Signs Map [5] that uses Google Maps. On Flickr one can find sets like Its Only Words [6] of images of texts. 1. Dictionary of Worlds in the Wild: http://tapor1-dev. mcmaster.ca/~dictwordwild/ What all these projects have in common is the photographic gaze that captures words or phrases in a context whether for aesthetic purposes or advocacy purposes. The Dictionary is no different, it is meant to provoke reflection on the wild context of text as it is encountered on the street. 6. Its Only Words: http://www.flickr.com/photos/red_devil/ sets/72157594359355250/ 2. Typography Kicks Ass: http://www.typographykicksass.com/ 3. The London Evening Standard Headline Generator: http:// thesurrealist.co.uk/standard.php 4. IllegalSigns.ca: http://illegalsigns.ca/ 5. Clickable Illegal Signs Map: http://illegalsigns.ca/?page_id=9 The Future The success of the project lies in how the participants push the simple assumptions encoded in the structure. The project would have failed had no one contributed, but with contributions come exceptions to every design choice. The types of text contributors want to collect and formally tag has led to the specification of a series of improvements that are being implemented with the support of TAPoR and SSHRC. _____________________________________________________________________________ 175 Digital Humanities 2008 _____________________________________________________________________________ Bibliography Davis, H. and Walton, P. (1983): Language, Image, Media, Blackwell, Oxford. Kress, G. and Van Leeuwen, T. (1996): Reading images: the grammar of visual design, Routledge, London. Kress, G. and Van Leeuwen, T. (2001): Multimodal Discourse: the modes and media of contemporary communication, Arnold, London. McCarty, Willard. 2008 (forthcoming). “Beyond the word: Modelling literary context”. Special issue of Text Technology, ed. Lisa Charlong, Alan Burke and Brad Nickerson. McGann, Jerome. 2004. “Marking Texts of Many Dimensions.” In Susan Schreibman, Ray Siemens and John Unsworth, eds. A Companion to Digital Humanities. Oxford: Blackwell. www. digitalhumanities.org/companion/, 16. (12/10/07). 198-217. Scollon, R. and Scollon, S. (2003): Discourses in Places: Language in the material world, Routledge, London. Van Leeuwen, T. (2005): Introducing social semiotics, Routledge, London. The TTC-Atenea System: Researching Opportunities in the Field of Art-theoretical Terminology Nuria Rodríguez nro@uma.es University of Málaga, Spain The purpose of this presentation is to demonstrate the research opportunities that the TTC-ATENEA system offers to specialists in Art Theory. Principally, we will focus on those functionalities that are able to assist specialists in the interpretation of the theoretical discourses, a complex task due to the ambiguity and inaccuracy that distinguish this type of artistic terminology. Introduction The TTC-ATENEA system is currently being developed under my supervision in a Project supported by the Ministerio de Educación y Ciencia of Spain called Desarrollo de un tesauro terminológico conceptual (TTC) de los discursos teórico-artísticos españoles de la Edad Moderna, complementado con un corpus textual informatizado (ATENEA) (HUM05-00539). It is an interdisciplinary project integrated by the following institutions: University of Málaga (Spain), University of de Santiago de Compostela (Spain), University of Valencia (Spain), European University of Madrid (Spain), the Getty Research Institute (Los Angeles, EE.UU.), the University of Chicago (EE.UU.) and the Centro de Documentación de Bienes Patrimoniales of Chile. Also, it maintains a fruitful collaboration with the Department of Computational Languages and Sciences of the University of Málaga, specifically with the Project called ISWeb: Una Infraestructura Básica para el Desarrollo de La Web Semántica y su Aplicación a la Mediación Conceptual (TIN2005-09098-C05-01). This system can be defined as a virtual net made up of two complementary components: an electronic text database (ATENEA), and the terminological conceptual thesaurus (TTC). The TTC is a knowledge tool for art specialists that consists of a compilation of described, classified and linked terms and concepts.The textual database consists of important art-historical texts (16th -18th centuries), encoded in XMLTEI (1), that supply the terms and concepts recorded in the thesaurus (2). We want to remark the significant contribution of ATENEA database to the field of the virtual and digital libraries in Spain. Despite the fact that the building of digital and virtual libraries is an important area of development (3), ATENEA represents one of the first textual databases about specialized subject matter –with exception of the digital libraries on literary texts(4). It is a very different situation from other contexts, where we find relevant specialized electronic textual collections. It _____________________________________________________________________________ 176 Digital Humanities 2008 _____________________________________________________________________________ is the case, for example, of the digital libraries about Italian artistic texts developed and managed by Signum (Scuola Normale Superiore di Pisa) (5); or the collections included in the ARTFL Project (University of Chicago) (6). Providing Solutios to the Ambiguity of Art-theoretical Terminology The configuration of the ATENEA-TTC system was determined by the main purpose of the project itself--that is, to provide a satisfactory answer to the terminological/conceptual problems that encumber the task of conducting research in art theory, due to the high degree of semantic density of the terminology contained in historical documents. In this respect, it is important to point out that the objective of the Project is not only to distinguish the terminological polysemy, that is, the different concepts that a term denotes (i.e., disegno as product, disegno as abstract concept, disegno as formal component of the paintings…), but also the different interpretations that the same concept assumes in discourses belonging to different authors. For instances, the concept disegno, understood as an abstract concept, has not the same interpretation in Zuccaro´s treatise than in Vasari´s one, what implies that the term disegno does not mean exactly the same in such texts. Indeed, this is one of the more problematic aspects of textual interpretation, since the same term assumes different semantic nuances in each text or treatise, increasing, in that way, the possibilities for ambiguity and error. In connection with this question, the eminent Professor Fernando Marías (7) has suggested the development of a “cronogeografía” of the art-theoretical concepts and terms. In his words, the so-called “cronogeografía” should be a precise means of control of the emergence and use of the theoretical vocabulary as well as of the ideas behind this. We consider that the methodology proposed by this Project could be an answer to Professor Marías´ suggestion. It is clear that this approach requires, firstly, a comparative conceptual analysis among the artistic texts stored in ATENEA in order to identify the different interpretations of each concept and, as a result, the semantic variations assumed by every term. To make visible in the thesaurus these semanticconceptual distinctions, we have proceeded to tag concepts and terms by mean of parenthetical qualifiers related to authors, which have been specifically developed to this end. These qualifiers specify the author who has used or defined them. Thus, the TTC records the conceptual-semantic variations among concepts and terms used by different authors, and also shows them graphically and visually. For example, the following representation generated by the system (figure 1): Fig. 1. Uses and interpretations of vagueza in the Spanish art –theory (17th century). reveals us, at a glance, that vagueza has the same interpretation in Sigüenza and Santos´s discourses, but that this assumes other senses in Carducho and Martinez´s treatises. This procedure implies other significant feature of the ATENEA-TTC system: that terms and concepts are described in the TTC according to how they has been used or defined in each particular text.This last point deserves to be emphasized. Effectively, there are other projects that also link terms to online dictionary entries, in which the user finds general definitions of the terms. Nevertheless, in the TTC terms and concepts are defined in reference to the specific texts in where they have been located. So, clicking on the selected term or concept, the user gets information about how such term or concept has been specifically used and defined in a particular text or by a particular author. In addition to that, the TTC-ATENEA system provides full interaction between the ATENEA corpus and the TTC. The system easily allows the definition of connections between a term or a concept in an XML-TEI marked text and the registry of the TTC. Thus, terms and concepts in the text are linked to all their relevant information, stored and structured in the TTC records. In the same way, the system allows to establish a connection between terms and concepts registered in the TTC and the different places where they appear in the XMLTEI marked texts. As far as XML-TEI marked texts and TTC are stored in the same database, we can establish these links in a totally effective and efficient way without having to access to different data repositories to retrieve and link this information. Consequently, one of the most important potentialities of the system as a research tool derives from the interactions that users are able to establish among the different types of information compiled in each repository. Finally, future extensions of the TTC-ATENEA will study the possibility of use not only a thesaurus but a full fledged OWL [7] ontology and its integration in the Khaos Semantic _____________________________________________________________________________ 177 Digital Humanities 2008 _____________________________________________________________________________ Web platform [8] [9] [10] in order to achieve a formal explicit representation of the artistic epistemology contained in the texts. Research Possibilities The system allows users to retrieve the stored information from both terminological-conceptual and textual repositories. Nevertheless, as indicated above, the most important research possibilities derive from the interactions that users are able to establish between each repository. To illustrate this, we will give a brief example. 2.TTC as a starting point. Imagine now that the specialist is interested in the use of a particular term. In this case, the most convenient is to search the term in the TTC. Made the query, the specialist retrieves the different concepts associated to the term, as in a conventional specialized dictionary. However, as the term is marked with the author-parenthetical qualifiers, the specialist also finds out in which discourses the term has been identified denoting each concept (see figure1). In fact, various hyperlinks enable him to go from the TTC record to the exact point of the texts -stored in ATENEA- where the term appears denoting such concepts (figure 3). 1. ATENEA Corpus as a starting point. Imagine that a specialist is interested in studying the meanings of the terms used by the Italian author F. Pacheco in his Spanish treatise (Diálogos de la Pintura, 1633). ATENEA Corpus enables the specialist to visualize the electronic transcription of the text, and explore it linguistically by means of different queries concordances, frequency lists, co-occurrences…- since the system has been implemented with its own textual analyzer. These functionalities do not constitute any novelty; all we know that. Consequently, the most relevant feature resides in the fact that the system allows user to browse directly from a term or concept located in a text to the TTC record where such term or concept is described according to its use in Carducho´s text (figure 2). Fig. 3. Connection between a concept registered in the TTC and the places where they appear in the XML-TEI marked texts. Once in ATENEA Corpus, the specialist has the opportunity to use its functionalities in order to specify his investigation. The presentation will show in full detail these possibilities of usage derived from this interactivity-based approach, and will include time for Q&A. We believe that the questions taken into consideration in our exposition will be of interest to other humanistic disciplines, since the difficulties of textual interpretations due to the terminological ambiguity is a common place in the field of the Humanities. Notes Fig. 2. Connection between a concept in an XML-TEI marked text (ATENEA) and the TTC record, where it is described. In this way, the specialist is able to identify the precise senses that the terms used by Carducho assume, and, therefore, analyze with more accuracy his theory about the painting. Once the specialist has accessed to the TTC records, the system offers him other research possibilities given that the recorded information is linked to other sections of the TTC as well as to other texts stored in ATENEA. Thus, the specialist is able to go more deeply in his investigation and to get complementary information browsing through the site. The presentation will show some clarifying examples of these other possibilities. (1) http://www.tei-c.org. The DTD used in this Project has been created from the TEI-Lite DTD. We have selected those elements that best adjusted to our requirements and we have added others considered necessaries. (2) For further information on this system see [1], [2] and [3]. (3) The most paradigmatic example is, without doubt, the Biblioteca Virtual Miguel de Cervantes [http://www.cervantes.virtual.com]. For more information, see the directories of the Junta de Andalucía [http://www.juntadeandalucia.es/cultura/bibliotecavirtualandalucia/ recursos/otros_recursos.cmd], Ministerio de Cultura [http://www. mcu.es/roai/es/comunidades/registros.cmd] or the list posted on the web of the University of Alicante [http://www.ua.es/es/bibliotecas/ referencia/electronica/bibdigi.html]. Also, see [4] and [5]. _____________________________________________________________________________ 178 Digital Humanities 2008 _____________________________________________________________________________ (4) Other Spanish projects about specialized digital library are: the Biblioteca Virtual del Pensamiento Político Hispánico “Saavedra Fajardo”, whose purpose is to develop a textual corpus of documents and sources about the Spanish theories on politic [http://saavedrafajardo. um.es/WEB/HTML/web2.html]; and the Biblioteca Digital Dioscórides (Complutense University of Madrid) that provides digitalized texts from its biomedicine-historical collection [http://www.ucm.es/ BUCM/foa/presentacion.htm]. (5) http://www.signum.sns.it (6) http://humanities.uchicago.edu/orgs/ARTFL/ (7) See [6] References Re-Engineering the Tree of Knowledge:Vector Space Analysis and CentroidBased Clustering in the Encyclopédie Glenn Roe glenn@diderot.uchicago.edu University of Chicago, USA Robert Voyer [1] Rodríguez Ortega, N. (2006). Facultades y maneras en los discursos de Francisco Pacheco y Vicente Carducho .Tesauro terminológico-conceptual. Málaga: Universidad de Málaga. rlvoyer@diderot.uchicago.edu University of Chicago, USA [2] Rodríguez Ortega, N. (2006). “Use of computing tools for organizing and Accessing Theory and Art Criticism Information: the TTC-ATENEA Project”, Digital Humanities`06 Conference, 4-9 de julio (2006), Universidad de la Sorbonne, París. russ@diderot@uchicago.edu University of Chicago, USA [3] Rodríguez Ortega, N. (2005). “Terminological/Conceptual Thesaurus (TTC): Polivalency and Multidimensionality in a Tool for Organizing and Accessing Art-Historical Information”, Visual Resource Association Bulletin, vol. 32, n. 2, pp. 34-43. Mark Olsen [4] Bibliotecas Virtuales FHL (2005). Madrid, Fundación Ignacio Larramendi. rmorriss@uchicago.edu University of Chicago, USA [5] García Camarero, E. y García Melero, L. A. (2001). La biblioteca digital. Madrid, Arco/libros. The Encyclopédie of Denis Diderot and Jean le Rond d’Alembert is one of the crowning achievements of the French Enlightenment. Mobilizing many of the great – and the not-sogreat – philosophes of the eighteenth century, it was a massive reference work for the arts and sciences, which sought to organize and transmit the totality of human knowledge while at the same time serving as a vehicle for critical thinking.The highly complex structure of the work, with its series of classifications, cross-references and multiple points of entry makes it not only, as has been often remarked, a kind of forerunner of the internet[1], but also a work that constitutes an ideal test bed for exploring the impact of new machine learning and information retrieval technologies. This paper will outline our ongoing research into the ontology of the Encyclopédie , a data model based on the classes of knowledge assigned to articles by the editors and representing, when organized hierarchically, a system of human understanding. Building upon past work, we intend to move beyond the more traditional text and data mining approaches used to categorize individual articles and towards a treatment of the entire encyclopedic system as it is elaborated through the distribution and interconnectedness of the classes of knowledge. To achieve our goals, we plan on using a vector space model and centroid-based clustering to plot the relationships of the Encyclopédie‘s epistemological categories, generating a map that will hopefully serve as a [6] Marías, F. (2004). “El lenguaje artístico de Diego Velázquez y el problema de la `Memoria de las pinturas del Escorial´”, en Actas del Simposio Internacional Velázquez 1999, Junta de Andalucía, Sevilla, pp. 167-177. [7] OWL Web Ontology Language. http://www.w3.org/TR/ owl-features/ [8] Navas-Delgado I., Aldana-Montes J. F. (2004). “A Distributed Semantic Mediation Architecture”. Journal of Information and Organizational Sciences, vol. 28, n. 1-2. December, pp. 135-150. [9] Moreno, N., Navas, I., Aldana, J.F. (2003). “Putting the Semantic Web to Work with DB Technology”. IEEE Bulletin of the Technical Committee on Data Engineering. December, pp. 49-54. [10] Berners-Lee, T., Hendler, J., Lassila, O. (2001). “The semantic web”. Scientific American (2001). Russell Horton Charles Cooney cmcooney@diderot.uchicago.edu University of Chicago, USA mark@barkov.uchicago.edu University of Chicago, USA Robert Morrissey _____________________________________________________________________________ 179 Digital Humanities 2008 _____________________________________________________________________________ corollary to the taxonomic and iconographic representations of knowledge found in the 18th century.[2] Over the past year we have conducted a series of supervised machine learning experiments examining the classification scheme found in the Encyclopédie, the results of which were presented at Digital Humanities 2007. Our intent in these experiments was to classify the more than 22,000 unclassified articles using the known classes of knowledge as our training model. Ultimately the classifier performed as well as could be expected and we were left with 22,000 new classifications to evaluate. While there were certainly too many instances to verify by hand, we were nonetheless encouraged by the assigned classifications for a small sample of articles. Due to the limitations of this exercise, however, we decided to leverage the information given us by the editors in exploring the known classifications and their relationship to each other and then later, to consider the classification scheme as a whole by examining the general distribution of classes over the entire work as opposed to individual instances. Using the model assembled for our first experiment - trained on all of the known classifications - we then reclassified all of the classified articles. Our goal in the results analysis was twofold: first, we were curious as to the overall performance of our classification algorithm, i.e., how well it correctly labeled the known articles; and secondly, we wanted to use these new classifications to examine the outliers or misclassified articles in an attempt to understand better the presumed coherency and consistency of the editors’ original classification scheme.[3] In examining some of the reclassified articles, and in light of what we know about Enlightenment conceptions of human knowledge and understanding – ideas for which the Encyclopédie and its editors were in no small way responsible – it would seem that there are numerous cases where the machine’s classification is in fact more appropriate than that of the editors. The machine’s inability to reproduce the editors’ scheme with stunning accuracy came somewhat as a surprise and called into question our previous assumptions about the overall structure and ordering of their system of knowledge. Modeled after Sir Francis Bacon’s organization of the Sciences and human learning, the Système Figuré des connaissances humaines is a typographical diagram of the various relationships between all aspects of human understanding stemming from the three “root” faculties of Reason, Memory, and Imagination.[4] It provides us, in its most rudimentary form, with a view from above of the editors’ conception of the structure and interconnectedness of knowledge in the 18th century. However, given our discovery that the editors’ classification scheme is not quite as coherent as we originally believed, it is possible that the Système figuré and the expanded Arbre généalogique des sciences et arts, or tree of knowledge, as spatial abstractions, were not loyal to the complex network of contextual relationships as manifested in the text. Machine learning and vector space analysis offer us the possibility, for the first time, to explore this network of classifications as a whole, leveraging the textual content of the entire work rather than relying on external abstractions. The vector space model is a standard framework in which to consider text mining questions. Within this model, each article is represented as a vector in a very high-dimensional space where each dimension corresponds to the words in our vocabulary. The components of our vectors can range from simple word frequencies, to n-gram and lemma counts, in addition to parts of speech and tf-idf (term frequencyinverse document frequency), which is a standard weight used in information retrieval. The goal of tf-idf is to filter out both extremely common and extremely rare words by offsetting term frequency by document frequency. Using tf-idf weights, we will store every article vector in a matrix corresponding to its class of knowledge. We will then distill these class matrices into individual class vectors corresponding to the centroid of the matrix.[5] Centroid or mean vectors have been employed in classification experiments with great success.[6] While this approach is inherently lossy, our initial research suggests that by filtering out function words and lemmatizing, we can reduce our class matrices in this way and still retain a distinct class core. Using standard vector space similarity measures and an open-source clustering engine we will cluster the class vectors and produce a new tree of knowledge based on semantic similarity. We expect the tree to be best illustrated as a weighted undirected graph, with fully-connected sub-graphs. We will generate graphs using both the original classifications and the machine’s decisions as our labels. Due to the size and scale of the Encyclopédie, its editors adopted three distinct modes of organization - dictionary/alphabetic, hierarchical classification, and cross-references - which, when taken together, were said to represent encyclopedic knowledge in all its complexity.[7] Using the machine learning techniques outlined above, namely vector space analysis and centroidbased clustering, we intend to generate a fourth system of organization based on semantic similarity. It is our hope that a digital representation of the ordering and interconnectedness of the Encyclopédie will highlight the network of textual relationships as they unfold within the work itself, offering a more inclusive view of its semantic structure than previous abstractions could provide. This new “tree of knowledge” can act as a complement to its predecessors, providing new points of entry into the Encyclopédie while at the same time suggesting previously unnoticed relationships between its categories. _____________________________________________________________________________ 180 Digital Humanities 2008 _____________________________________________________________________________ References [1] E. Brian, “L’ancêtre de l’hypertexte”, in Les Cahiers de Science et Vie 47, October 1998, pp. 28-38. [2] R. Darnton. “Philosophers Trim the Tree of Knowledge.” The Great Cat Massacre and Other Essays in French Cultural History. London: Penguin, 1985, pp. 185-207. [3] R. Horton, R. Morrissey, M. Olsen, G. Roe, and R.Voyer. “Mining Eighteenth Century Ontologies: Machine Learning and Knowledge Classification in the Encyclopédie.” under consideration for Digital Humanities Quarterly. [4] For images of the Système figuré des connaissance humaines and the Arbre généalogique des sciences et des arts principaux see, http://www.lib.uchicago.edu/efts/ARTFL/projects/encyc/ systeme2.jpg; and http://artfl.uchicago.edu/cactus/. [5] G. Salton, A. Wong, and C. S.Yang. “A vector space model for automatic indexing.” Communications of the ACM. November 1975. [6] E. Han and G. Karypis. “Centroid-Based Document Classification.” Principles of Data Mining and Knowledge Discovery. New York: Springer, 2000, pp. 424-431. [7] G. Blanchard and M. Olsen. “Le système de renvois dans l’Encyclopédie: une cartographie de la structure des connaissances au XVIIIème siècle.” Recherches sur Diderot et sur l’Encyclopédie, April 2002. Assumptions, Statistical Tests, and Non-traditional Authorship Attribution Studies -- Part II Joseph Rudman jr20@andrew.cmu.edu Carnegie Mellon University Statistical inferences are based only in part upon the observations. An equally important base is formed by prior assumptions about the underlying situation. Even in the simplist cases, there are explicit or implicit assumptions about randomness and independence.... Huber Introduction Controversy surrounds the methodology used in nontraditional authorship attribution studies -- those studies that make use of the computer, stylistics, and the computer. [Rudman 2003] [Rudman 1998] One major problem is that many of the most commonly used statistical tests have assumptions about the input data that do not hold for the primary data of these attribution studies -- the textual data itself (e.g. normal distributions, randomness, independence). “...inappropriate statistical methods....In particular, asymtotic normality assumptions have been used unjustifiably, leading to flawed results.” [Dunning, p.61] “Assumptions such as the binomiality of word counts or the independence of several variables chosen as markers need checking.” [Mosteller and Wallace] This paper looks at some of the more frequently used tests (e.g. chi-square, Burrows Delta) and then at the questions,“Are there assumptions behind various tests that are not mentioned in the studies?” and “Does the use of statistical tests whose assumptions are not met invalidate the results?” Part I of this paper was presented at the “10th Jubilee Conference of the International Federation of Classification Societies,” 28 July 2006 in Ljubljana, Slovenia. The idea behind Part I was to get as much input as possible from a cross-section of statisticians from around the world. Part II takes the statisticians’ input and continues looking at the problem of assumptions made by various statistical tests used in non-traditional attribution studies. Because of the complicated intertwining of disciplines in non-traditional authorship attribution, each phase of the experimental design should be as accurate and as scientifically rigorous as possible. The systematic errors of each step must be computed and summed so that the attribution study can report an overall systematic error -- Mosteller and Wallace come close to doing this in their “Federalist” study -- and it seems that no one else tries. _____________________________________________________________________________ 181 Digital Humanities 2008 _____________________________________________________________________________ It is the systematic error that drives the focus of this paper -the assumptions behind the statistical tests used in attribution studies. I am not concerned with assumptions made by practitioners that are not an integral part of the statistics - e.g. Morton, using the cusum technique, assumes that style markers are constant across genre -- this has been shown to be false but has nothing to do with the cusum test itself. [Sanford et al.] There are many statistical tests along with their many assumptions that are used in non-traditional attribution studies - e.g the Efron-Thisted tests are based on the assumption that things (words) are well mixed in time [Valenza] -- obviously not true in attribution studies. Because of time constraints, I want to limit the number of tests discussed to three: the chi-square, Burrows’ Delta, and the third being more of a `catagory’ -- machine learning. This paper looks at each test and attempts to explain why the assumptions exist -- how they are determined -- how integral assumptions are to the use of the test. Chi-Square test The chi-square test, in all of its manifestations, is ubiquitous in non-traditional authorship studies. It also is the test that has recieved the most criticism from other practitioners. Delcourt lists some of the problems [Delcourt (from Lewis and Burke)]: 1) Lack of independence among the single events or measures 2) Small theoretical frequencies 3) Neglect of frequencies of non-occurence 4) Failure to equalize the sum of the observed frequencies and the sum of the theoretical frequencies 5) Indeterminate theoretical frequencies 6) Incorrect or questionable categorizing 7) Use of non-frequency data 8) Incorrect determination of the number of degrees of freedom Burrows’ Delta This section discusses the assumptions behind Burrows’ Delta. The assumptions behind Burrows’ delta are articulated by Shlomo Argamon: 1) Each indicator word is assumed to be randomly distributed 2) Each indicator word is assumed to be statistically independent of every other indicator word’s frequency The ramifications of ignoring these assumptions are discussed. Do the assumptions really make a difference in looking at the results? Burrows’ overall methodology and answers are to be highly commended. Machine Learning -- Data Mining Almost all machine learning statistical techniques assume independence in the data. David Hand et al. say, “...of course...the independence assumption is just that, an assumption, and typically it is far too strong an assumption for most real world data mining problems.” [Hand et al., 289] Malerba et al. state, “Problems caused by the independence assumption are particularly evident in at least three situations: learning multiple attributes in attribute-based domains, learning multiple predicates in inductive logic programming, and learning classification rules for labeling.” The ramifications of ignoring these assumptions are discussed. Conclusion The question at hand is not, “Does the violation of the assumptions matter?”, but rather, “How much does the violation of assumptions matter?” and, “How can we either correct for this or calculate a systematic error?” Does the chi-square test always demand independence and randomness, and ever a normal distribution? Gravetter and Wallnau say that although the chi-square test is non-parametric, “...they make few (if any) assumptions about the population distribution.” [Gravetter and Wallnau, 583] How are we to view attribution studies that violate assumptions, yet show some success with studies using only known authors? The works of McNemar, Baayen, and Mosteller and Wallace are discussed. The ramifications of ignoring these assumptions are discussed. The following are answers often given to questions about assumptions: _____________________________________________________________________________ 182 Digital Humanities 2008 _____________________________________________________________________________ 1) The statistics are robust The tests are so robust that the assumptions just do not matter. 2) They work until they don’t work You will know when this is and then you can re-think what to do. 3) Any problems are washed out by statistics There is so much data that any problems from the violation of assumptions are negligible. These answers and some solutions are discussed. Preliminary References Argamon, Shlomo. “Interpreting Burrows’ Delta: Geometric and Probabilistic Foundations.” (To be published in Literary and Linguistic Computing.) Pre-print courtesy of author, 2006. Baayen, R. Harald. “The Randomness Assumption in Word Frequency Statistics.” Research in Humanities Computing 5. Ed. Fiorgio Perissinotto. Oxford: Clarendon Press, 1996, 17-31. Banks, David. “Questioning Authority.” Classification Society of North America Newsletter 44 (1996): 1. Box, George E.P. et al. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: John Wiley and Sons, 1978. Burrows, John F. “`Delta’: A Measure of Stylistic Difference and a Guide to Likely Authorship.” Literary and Linguistic Computing17.3 (2002): 267-287. Cox, C. Phillip. A Handbook of Introductory Statistical Methods. New York: John Wiley and Sons, 1987. Delcourt, Christian. “Stylometry.” Revue Belge de Philologie et d’histoire 80.3 (2002): 979-1002. Dunning, T. “Accurate Methods for the Statistics of Suprise and Coincidence.” Computational Linguistics , 19.1 (1993): 61-74. Dytham, Calvin. Choosing and Using Statistics: A Biologist’s Guide. (Second Edition) Malden, MA: Blackwell, 2003. Holmes, David I. “Stylometry.” In Encyclopedia of Statistical Sciences (Update of Volume 3). Eds. Samuel Kotz, Campbell, and David Banks. New York: John Wiley and Sons, 1999. Hoover, David L. “Statistical Stylistics and Authorship Attribution: An Empirical Investigation.” Literary and Linguistic Computing 16.4 (2001): 421-444. Huber, Peter J. Robust Statistics. New York: John Wiley and Sons, 1981. Khmelev, Dmitri V., and Fiona J. Tweedie. “Using Markov Chains for Identification of Writers.” Literary and Linguistic Computing 16.3 (2001): 299-307. Lewis, D., and C.J. Burke. “The Use and Misuse of the Chisquared Test.” Psychological Bulletin 46.6 (1949): 433-489. Malerba, D., et al. “ A Multistrategy Approach to learning Multiple Dependent Concepts.” In Machine Learning and Statistics:The Interface. Eds. G. Nakhaeizadeh and C.C. Taylor. New York: John Wiley and Sons, 87-106. McNemar, Quinn. Psychological Statistics (3rd Edition). New York: John Wiley and Sons, 1962. Mosteller, Fredrick, and David L. Wallace. Applied Bayesian and Classical Inference:The Case of the “Federalist Papers.” New York: Springer-Verlag, 1984. Rudman, Joseph. “Cherry Picking in Non-Traditional Authorship Attribution Studies.” Chance 16.2 (2003): 26-32. Rudman, Joseph. “The State of Authorship Attribution Studies: Some Problems and Solutions.” Computers and the Humanities 31 (1998): 351-365. Sanford, Anthony J., et al. “A Critical Examination of Assumptions Underlying the Cusum Technique of Forensic Linguistics.” Forensic Linguistics 1.2 (1994): 151-167. Valenza, Robert J. “Are the Thisted-Efron Authorship Tests Valid?” Computers and the Humanities 25.1 (1991): 27-46. Wachal, Robert Stanly. “Linguistic Evidence, Statistical Inference, and Disputed Authorship.” Ph.D Dissertation, University of Wisconsin, 1966. Williams, C.B. “A Note on the Statistical Analysis of Sentence Length as a Criterion of Literary Style.” Biometrica 31 (19391940): 356-351. Gravetter, Fredrick J., and Larry B. Wallnau. Statistics for the Behavioral Sciences. (5th Edition) Belmont, CA: Wadsworth Thomson Learning, 2000. Hand, David, et al. Principles of Data Mining. Cambridge, MA: The MIT Press, 2001. _____________________________________________________________________________ 183 Digital Humanities 2008 _____________________________________________________________________________ Does Size Matter? A Reexamination of a Timeproven Method Jan Rybicki jrybicki@ap.krakow.pl Pedagogical University, Krakow, Poland Previous stylometric studies (Rybicki 2006, 2006a, 2007) of patterns in multivariate diagrams of correlation matrices derived from relative frequencies of most frequent words in character idiolects (Burrows 1987) in a number of originals and translations (respectively, Sienkiewicz’s trilogy of historical romances and its two English translations; Rybicki’s Polish translations of novels by John le Carré; and the three English versions of Hamlet, Q1, Q2, F, and its translations into nine different languages) have all yielded interesting similarities in the layout of data points in the above-mentioned diagrams for corresponding originals and translations. The repetitiveness of the observed phenomenon in many such pairs immediately raised some questions as to its causes – the more so as the most-frequent-word lists for any original and its translations, consisting primarily of functions words, modals, pronouns, prepositions and articles (if any in a given language) contains few direct counterparts in any two languages. The primary suspect for a possible bias was the difference in size between the parts of individual characters, since that particular feature invariably remains proportionally similar between original and translation (Rybicki 2000). The import of this element remained unnoticed in many other studies employing Burrows’s time-proven method (later developed, evaluated and applied by a number of scholars, including Hoover 2002) since they were either conducted on equal samples or, more importantly, although some did address the problem of translation, they never dealt with so many translations at a time as did e.g. the Hamlet study (Rybicki 2007). The emerging suspicion was that since the sizes of the characters studied do not vary significantly between the twelve versions of Hamlet, this proportion might heavily influence the multivariate graphs produced – the more so as, in most cases, the most talkative characters occupied central positions in the graphs, while the lesser parts were usually limited to the peripheries. This (together with similar effects observed in earlier studies) had to lead to a reassessment of the results. Also, since most studies were conducted in traditional maledominated writing, resulting in female characters speaking little in proportion even to their importance in the plot, these peripheries usually included separate groups of women; while this was often seen as the authors’ ability to stylistically differentiate “male” and “female” idiom, the size bias could distort even this quite possible effect. A number of tests was then performed on character idiolects and narrative fragments of various sizes and various configurations of characters from selected English novels and plays – ranging from Ruth and David Copperfield; from Conan Doyle to Poe; narration in Jane Austen, or plays by Ben Jonson and Samuel Beckett – to investigate how the impact of size distorts the image of stylistic differences presented in multivariate diagrams. One possible source of the bias is the way relative frequencies of individual words are calculated. Being a simple ratio of word frequency to size of sample, it may be as unreliable as another very similar formula, the type-to-token ratio (or, indeed, one as complex as Yule’s K), has been found to be unreliable as a measure of vocabulary richness in samples of different size (Tweedie & Baayen 1997). Also, due to the nature of multivariate analyses used (both Factor Analysis and Multimensional Scaling), it is not surprising that the less stable statistics of the less talkative characters would have their data points pushed to the peripheries of the graph. Finally, since the list of the most frequent words used in the analysis is most heavily influenced by the longest parts in any text, this might also be the reason for the “centralising” bias visible in data points for such characters. It should be stressed that the above reservations do not, in any way, invalidate the entire method; indeed, results for samples of comparative size remain reliable and unbiased. Also, it should not be forgotten that size of a character’s part is in itself an individuating feature of that character. References Burrows, J.F. (1987). Computation into Criticism: A Study of Jane Austen’s Novels and an Experiment in Method. Oxford: Clarendon Press. Hoover, D.L. (2002). New Directions in Statistical Stylistics and Authorship Attribution. Proc. ALLC/ACH. Rybicki, J. (2000). A Computer-Assisted Comparative Analysis of Henryk Sienkiewicz’s Trilogy and its Two English Translations. PhD thesis, Kraków: Akademia Pedagogiczna. Rybicki, J. (2006). Burrowing into Translation: Character Idiolects in Henryk Sienkiewicz’s Trilogy and its Two English Translations. LLC 21.1: 91-103. Rybicki, J. (2006a). Can I Write like John le Carré. Paris: Digital Humanities 2006. Rybicki, J. (2007). Twelve Hamlets: A Stylometric Analysis of Major Characters’ Idiolects in Three English Versions and Nine Translations. Urbana-Champaign: Digital Humanities 2007. F. Tweedie, R. Baayen (1997). Lexical ‘constants’ in stylometry and authorship studies, <http://www.cs.queensu.ca/achallc97/ papers/s004.html> _____________________________________________________________________________ 184 Digital Humanities 2008 _____________________________________________________________________________ The TEI as Luminol: Forensic Philology in a Digital Age Stephanie Schlitz sschlitz@bloomu.edu Bloomsburg University, USA A significant number of digital editions have been published in recent years, and many of these serve as exemplars for those of us working within the digital editing community. A glance at Electronic Text Editing, for example, published in 2006, indicates that such projects address a wide berth of editorial problems, ranging from transcriptional practices to document management to authenticity. And they offer a wealth of information to editors working on various types of digital editions. In this paper, I consider an editorial problem that has yet to be resolved. My discussion centers on the difficulties that arise in editing a single, albeit somewhat unusual, Icelandic saga: Hafgeirs saga Flateyings. This saga is preserved in an unsigned, eighteenth-century manuscript, Additamenta 6, folio (Add. 6, fol.). Today housed in the collection of the Arni Magnusson Institute for Icelandic Studies in Reykjavík, Iceland, the manuscript was originally held as part of the Arnamagnæan Collection in Copenhagen, Denmark. According to the flyleaf: “Saga af Hafgeyre flateying udskreven af en Membran der kommen er fra Island 1774 in 4to exarata Seculo xij” (Hafgeirs saga Flayetings was copied from a twelfth-century manuscript written in quarto, which came [to Copenhagen] from Iceland in 1774). While such a manuscript might appear unremarkable, as a number of paper manuscripts were copied during the late eighteenth century in Copenhagen, then the capital of Iceland and the seat of Icelandic manuscript transmission during this period, only twelve Old Norse/Old Icelandic manuscripts of those catalogued in the Copenhagen collections are dated to the twelfth century, while a mere eighteen are dated to 1200 (Kalund 512).The dating on the flyleaf is therefore unusual, and as it turns out, suspect as well, since no catalog entry exists to record the existence of the alleged source manuscript. Moreover, according to Jorgensen, the motif sequences found in Hafgeirs saga bear a striking resemblance to those found in the well-known mythical-heroic saga Hálfdanars saga Brönufóstra (157). And in a fascinating argument based primarily on this fact, Jorgensen argues that Add. 6, fol. is a forgery, claiming that Þorlákur Magnússon Ísfiord, an Icelandic student studying and working in Copenhagen during the 1780s, composed and sold Hafgeirs saga as a copy of an authentic medieval Icelandic saga (163). In spite of its questionable origin, Hafgeirs saga stands as a remnant of Iceland’s literary, linguistic, and textual history, and Add. 6, fol. can therefore be viewed as an important cultural artefact. As the editor of the Hafgeirs saga manuscript, my aim is to provide a ‘reliable’ (see “Guidelines for Editors of Scholarly Editions” Section 1.1) electronic edition of the text and the manuscript. But the question, at least until recently, was how? What is the best way to represent such a text? Encoding the work according to a markup standard such as the TEI Guidelines is surely a starting point, but doing so doesn’t solve one of the primary concerns: How to represent the manuscript reliably (which presents a complex editorial problem of its own), while at the same time illuminating the textual and linguistic ‘artefacts’ that may offer readers insight into the saga’s origin? At the 2007 Digital Humanities Summer Institute, Matthew Driscoll gave a talk entitled “Everything But the Smell: Toward a More Artefactual Digital Philology.” The talk provided a brief history of the shift toward ‘new’ philology, and, importantly, underscored the significance of the material or ‘artefactual’ aspect of new philology, which views manuscripts as physical objects and thus as cultural artefacts which offer insight into the “process to which they are witness” (“Everything But the Smell: Toward a More Artefactual Digital Philology”). The TEI, Driscoll pointed out, offers support for artefactual philology, and the descriptive framework of the P5 Guidelines, which defines Extensible Markup Language (XML) as the underlying encoding language, is ideal for this kind of work. Moreover, as Driscoll suggests, there is “no limit to the information one can add to a text - apart, that is, from the limits of our own imagination” (“Electronic Textual Editing: Levels of Transcription”). To be sure, this paper does not lapse into what McCarty refers to as the ‘complete encoding fallacy’ or the ‘mimetic fallacy’ (see Dahlström 24), but it does agree with Driscoll in arguing that P5-conformant editions, which can offer a significant layering of data and metadata, have the potential to move the reader beyond the aesthetics of sensory experience. By definition, artefactual philology portends a kind of ‘evidentiary’ approach, one that can frame textual features, including linguistic and non-linguistic features (such as lacunae) for example, as kinds of evidence. Evidence of what? That is in the hands of the editors and the readers, but conceivably: linguistic development, the transmission process, literary merit, and so on. And when an evidentiary approach to philology is defined within a ‘generative’ approach to a scholarly edition (see Vanhoutte’s “Generating” 164), a new direction in electronic editing becomes possible. This paper explores this new direction. It shows how the Hafgeirs saga edition employs such a framework to address the problem of describing linguistic and non-linguistic artefacts, which are precisely the kinds of evidence that can bear witness to the composition date and transmission process. And it demonstrates how the display decisions result in an interactive and dynamic experience for the edition’s readers. For example, in addition to elements defined within existing modules of the TEI, the edition’s schema defines four new elements1. Drawn from the perspectives of historical and socio- linguistics, these elements are intended to aid readers in evaluating the saga’s composition date. Given the ‘logic _____________________________________________________________________________ 185 Digital Humanities 2008 _____________________________________________________________________________ of abundance’ (see Flanders 135), encoding the metadata described in the new elements beside the descriptive data described by pre-defined elements (such as <del> for example) can be accomplished without sacrificing the role of the text as a literary and cultural artefact. Because the transformation of the source XML has been designed to display interactively the various encodings of the text2, readers can view or suppress the various descriptions and generate their own novel versions of the text. Readers can display archaisms, for instance, and assess whether they are “affectation[s] of spurious age” (Einar 39) or features consistent with the textual transmission process, and they can view borrowings, for instance, and assess whether they preclude a medieval origin or are to be expected in a text ostensibly copied by a scribe living in Copenhagen during the eighteenth century. Or they can suppress these features and view the normalized transcription, the semi-diplomatic transcription, emendations, editorial comments, or any combination of these. Ultimately, this paper synthesizes aspects of text editing, philology, and linguistics to explore a new direction in digital editing. In doing so, it frames P5 XML as a kind of luminol that, when transformed, can be used to illuminate new types of evidence: linguistic and philological data.And the goal is identical to that of a crime-scene investigator’s: Not necessarily to solve the case, but to preserve and to present the evidence. Notes Driscoll, M.J. Text Encoding Initiative Consortium. 15 August 2007. “Electronic Textual Editing: Levels of Transcription.” <http://www.tei-c.org/Activities/ETE/Preview/driscoll.xml>. Einar Ól. Sveinsson. Dating the Icelandic Sagas. London:Viking Society for Northern Research, 1958. Flanders, Julia. “Gender and the Eectronic Text.” Electronic Text. Ed. Kathryn Sutherland. Clarendon Press: Oxford, 1997. 127-143. “Guidelines for Editors of Scholarly Editions.” Modern Language Association. 25 Sept. 2007. 10 Nov. 2007. <http:// www.mla.org/cse_guidelines>. Jorgensen, Peter. “Hafgeirs saga Flateyings: An EighteenthCentury Forgery.” Journal of English and Germanic Philology. LXXVI (1977): 155-164. Kalund, Kristian. Katalog over de oldnorsk-islandske handskrifter i det store kongelige bibliotek og i universitetsbiblioteket. Kobenhavn: 1900. Vanhoutte, Edward. “Traditional Editorial Standards and the Digital Edition.” Learned Love: Proceedings from the Emblem Project Utrecht Conference on Dutch Love Emblems and the Internet (November 2006). Eds. Els Stronks and Peter Boot. DANS Symposium Publications 1: The Hague, 2007. 157-174. [1] The <borrowing> element describes a non-native word which has been adopted into the language. Distinct from the <foreign> element, a borrowing may have the following attributes: @sourcelang (source language), @borrowdate (date of borrowing), @borrowtype (type of borrowing; e.g. calque, loanword). The <modernism> element describes a word, phrase, usage, or peculiarity of style which represents an innovative or distinctively modern feature. The <neologism> element describes a word or phrase which is new to the language or one which has been recently coined. The <archaism> element describes an archaic morphological, phonological, or syntactic feature or an archaic word, phrase, expression, etc. [2]My discussion of the display will continue my in-progress work with Garrick Bodine, most recently presented at TEI@20 on November 1, 2007: “From XML to XSL, jQuery, and the Display of TEI Documents.” References Burnard, Lou, and Katherine O’Brien O’Keefe and John Unsworth eds. Electronic Text Editing. New York: The Modern Language Association of America, 2006. Dalhström, Mats. “How Reproductive is a Scholarly Edition?” Literary and Linguistic Computing. 19:1 (2004): 17-33. Driscoll, M.J. “Everything But the Smell: Toward a More Artefactual Digital Philology.” Digital Humanities Summer Institute. 2007. _____________________________________________________________________________ 186 Digital Humanities 2008 _____________________________________________________________________________ A Multi-version Wiki Desmond Schmidt schmidt@itee.uq.edu.au University of Queensland, Australia Nicoletta Brocca nbrocca@unive.it Università Ca’ Foscari, Italy Domenico Fiormonte fiormont@uniroma3.it Università Roma Tre, Italy If, until the middle of the nineties, the main preoccupation of the scholarly editor had been that of conserving as faithfully as possible the information contained in the original sources, in the last ten years attention has shifted to the user’s participation in the production of web content. This development has occurred under pressure from the ‘native’ forms of digital communication as characterised by the term ‘Web 2.0’ (Tapscott and Williams, 2006). We are interested in harnessing the collaborative power of these new environments to create ‘fluid, co-operative, distributed’ (Robinson 2007: 10) digital versions of our textual cultural heritage. The approach taken by Hypernietzsche (Barbera, 2007) similarly tries to foster collaboration around cultural heritage texts. However, while Hypernietzsche focuses on annotation, what we have developed is a wiki-inspired environment for documents that were previously considered too complex for this kind of editing. Before explaining how this can work in practice, it is worthwhile to reflect on why it is necessary to change our means of editing cultural heritage texts. G.T.Tanselle has recently argued (2006) that in the digital medium ‘we still have to confront the same issues that editors have struggled with for twenty-five hundred years’. That may be true for analysis of the source texts themselves, but digitisation does change fundamentally both the objectives of editing and the function of the edition. For the representation of a text ‘is conditioned by the modes of its production and reproduction’ (Segre, 1981). The wellknown editorial method of Lachmann, developed for classical texts, and that of the New Bibliography for early printed books, both assumed that the text was inherently corrupt and needed correction into the single version of a print edition. With the advent of the digital medium textual critics ‘discovered that texts were more than simply correct or erroneous’ (Shillingsburg, 2006: 81).The possibility of representing multiple versions or multiple markup perspectives has long been seen as an enticing prospect of the digital medium, but attempts to achieve this so far have led either to complexity that taxes the limitations of markup (Renear, 1997: 121) or to an overload of information and a ‘drowning by versions’ (Dalhstrom, 2000). It is now generally recognised that written texts can contain complexities and subtleties of structure that defeat the power of markup alone to represent them (Buzzetti, 2002). In our talk we would like to present three examples of how overlapping structures can be efficiently edited in a wiki: of a modern genetic text in Italian, of a short classical text, the ‘Sybilline Gospel’, and of a short literary text marked up in various ways. Our purpose is to demonstrate the flexibility of the tool and the generality of the underlying algorithm, which is not designed for any one type of text or any one type of editing. However, because of the limited space, we will only describe here the Sibylline Gospel text. This is a particularly interesting example, because it not only has a complex manuscript tradition but it has also been deliberately altered throughout its history like agenetic text. The Sibylla Tiburtina is an apocalyptic prophecy that describes the nine ages of world history up to the Last Judgement. The first version of this prophecy, which enjoyed a widespread and lasting popularity throughout the whole medieval era, was written in Greek in the second half of the 4th century. Of this lost text we have an edited Byzantine version, dating from the beginning of the 6th century, and several Latin versions, besides translations in Old French and German, in oriental languages (Syric, Arabic and Ethiopian) and in Slavic and Romanian. The known Latin manuscripts number approximately 100, ranging in date from the mid 11th century to the beginning of the 16th. In the course of these centuries, with a particular concentration between the 11th and 12th centuries, the text was subjected to continuous revisions, in order to adapt it. This is demonstrated both by the changing names of eastern rulers mentioned in the ninth age, which involves the coming of a Last Roman Emperor and of the Antichrist, and by the introduction of more strictly theological aspects, especially in the so-called ‘Sibylline Gospel’, that is an account of Christ’s life presented by the Sibyl under the fourth age. No critical edition of the Sibylla Tiburtina based on all its Latin versions has yet been produced, although this is the only way to unravel the genesis of the text and the history of its successive reworkings. A classical type of critical edition, however, would not be appropriate, nor would it make sense, to establish a critical text in the traditional sense as one ‘cleaned of the errors accumulated in the course of its history and presented in a form most closely imagined to have been the original’. With this kind of representation one would have on the one hand the inconvenience of an unwieldy apparatus criticus and on the other, the serious inconsistency of a ‘critical’ text that never had any real existence. To handle cases like this, we are well advanced in the development of a multi-version document wiki application. The multi-version document (or MVD) concept is a further development of the variant graph model described at Digital Humanities 2006 and elsewhere (Schmidt and Fiormonte, 2007). In a nutshell the MVD format stores the text as a graph that accurately represents a single work in digital form, however many versions or markup perspectives it may be composed of. An MVD file is no mere aggregation of separate files; it is a single digital entity, within which versions may be efficiently searched and compared. It consists of three parts: _____________________________________________________________________________ 187 Digital Humanities 2008 _____________________________________________________________________________ 1) The variant-graph consisting of a list of the differences between the versions 2) A description of each of the versions, including a short name or siglum, e.g. ‘L10’, a longer name e.g. ‘London, Brit. Lib. Cotton Titus D.III, saec. XIII’ and a group name. 3) A list of groups. A group is a name for a group of versions or other groups. For example, in the Sybilline Gospel text there are three recensions, to which each of the manuscripts belong. The wiki application consists of two JAVA packages (outlined in bold): This application is suitable for collaborative refinement of texts within a small group of researchers or by a wider public, and attempts to extend the idea of the wiki to new classes of text and to new classes of user. References Barbera, M. (2007) The HyperLearning Project: Towards a Distributed and Semantically Structured e-research and e-learning Platform. Literary and Linguistic Computing, 21(1), 77-82. Buzzetti, D. (2002) Digital Representation and the Text Model. New Literary History 33(1), 61–88. Dahlström, M. (2000) Drowning by Versions. Human IT 4. Retrieved 16/11/07 from http://www.hb.se/bhs/ith/4-00/ md.htm Renear, A. (1997) Out of Praxis: Three (Meta) Theories of Textuality. In Electronic Text, Sutherland, K. (Ed.) Clarendon Press, Oxford, pp. 107–126. At the lower level the NMerge package implements all the functionality of the MVD format: the searching, comparison, retrieval and saving of versions. It can also export an MVD into a readable XML form. The text of each version is recorded in a simplified form of TEI-XML, but the MVD format does not rely on any form of markup, and is equally capable of handling binary file formats. Building on this package, the Phaidros web application provides various forms for viewing and searching a multi-version document: for comparing two versions, for viewing the text alongside its manuscript facsimile, for reading a single version or for examining the list of versions.The user may also edit and save the XML content of each of these views. Since NMerge handles all of the overlapping structures, the markup required for each version can be very simple, as in a real wiki. In the drawing below the differences between the two versions are indicated by underlined or bold text. Robinson, P. (2007) Electronic editions which we have made and which we want to make. In A. Ciula and F. Stella (Eds), Digital Philology and Medieval Texts, Pisa, Pacini, pp. 1-12. Schmidt D. and Fiormonte, D. (2007) Multi-Version Documents: a Digitisation Solution for Textual Cultural Heritage Artefacts. In G. Bordoni (Ed.), Atti di AI*IA Workshop for Cultural Heritage. 10th Congress of Italian Association for Artificial Intelligence, Università di Roma Tor Vergata,Villa Mondragone, 10 Sept., pp. 9-16. Segre, C. (1981) Testo. In Enciclopedia Einaudi Vol. 14. Einaudi, Torino, pp. 269-291. Shillingsburg, P. (2006) From Gutenberg to Google. Cambridge University Press, Cambridge. Tanselle, G. (2006) Foreword in Burnard, L., O’Brien O’Keeffe, K. and Unsworth, J. (Eds.) Electronic Textual Editing. Text Encoding Initiative, New York and London. Tapscott, D., and Williams, A. (2006) Wikinomics: How Mass Collaboration Changes Everything. Portfolio, New York. _____________________________________________________________________________ 188 Digital Humanities 2008 _____________________________________________________________________________ Determining Value for Digital Humanities Tools development community and as developers perceive how their home institutions and the community for which the software was developed value their work. Susan Schreibman The data for this paper will be drawn from two sources. The first source is a survey carried out by the authors in 2007 on The Versioning Machine, the results of which were shared at a poster session at the 2007 Digital Humanities Conference. The results of this survey were intriguing, particularly in the area of the value of The Versioning Machine as a tool in which the vast majority of respondents found it valuable as a means to advance scholarship in spite of the fact that they themselves did not use it. As a result of feedback by the community to that poster session, the authors decided to conduct a focused survey on tool development as a scholarly activity. sschreib@umd.edu University of Maryland, USA Ann Hanlon ahanlon@umd.edu University of Maryland, USA Tool development in the Digital Humanities has been the subject of numerous articles and conference presentations (Arts and Humanities Research Council (AHRC), 2006; Bradley, 2003; McCarty, 2005; McGann, 2005; Ramsay, 2003, 2005; Schreibman, Hanlon, Daugherty, Ross, 2007; Schreibman, Kumar, McDonald, 2003; Summit on Digital Tools for the Humanities, 2005; Unsworth, 2003). While the purpose and direction of tools and tool development for the Digital Humanities has been discussed and debated in those forums, the value of tool development itself has seen little discussion. This is in part because tools are developed to aid and abet scholarship – they are not necessarily considered scholarship themselves. That perception may be changing, though. The clearest example of such a shift in thinking came from the recent recommendations of the ACLS Commission on Cyberinfrastructure for the Humanities and Social Sciences, which called not only for “policies for tenure and promotion that recognize and reward digital scholarship and scholarly communication” but likewise stated that “recognition should be given not only to scholarship that uses the humanities and social science cyberinfrastructure but also to scholarship that contributes to its design, construction and growth.” On the other hand, the MLA Report on Evaluating Scholarship for Tenure and Promotion found that a majority of departments have little to no experience evaluating refereed articles and monographs in electronic format. The prospects for evaluating tool development as scholarship in those departments would appear dim. However, coupled with the more optimistic recommendations of the ACLS report, as well the MLA Report’s findings that evaluation of work in digital form is gaining ground in some departments, the notion of tool development as a scholarly activity may not be far behind. In 2005, scholars from the humanities as well as the social sciences and computer science met in Charlottesville, Virginia for a Summit on Digital Tools for the Humanities.While the summit itself focused primarily on the use of digital resources and digital tools for scholarship, the Report on Summit Accomplishments that followed touched on development, concluding that “the development of tools for the interpretation of digital evidence is itself research in the arts and humanities.” The goal of this paper is to demonstrate how the process of software or tool development itself can be considered the scholarly activity, and not solely a means to an end, i.e. a feature or interface for a content-based digital archive or repository. This paper will also deal with notions of value: both within the After taking advice from several prominent Digital Humanities tool developers, the survey has been completed and will be issued in December to gather information on how faculty, programmers, web developers, and others working in the Digital Humanities perceive the value and purpose of their software and digital tool development activities.This paper will report the findings of a web-based survey that investigates how tools have been conceived, executed, received and used by the community. Additionally, the survey will investigate developers’ perceptions of how tool development is valued in the academic community, both as a contribution to scholarship in their particular fields, and in relation to requirements for tenure and promotion. The authors will use these surveys to take up John Unsworth’s challenge, made at the 2007 Digital Humanities Centers Summit “to make our difficulties, the shortcomings of our tools, the challenges we haven’t yet overcome, something that we actually talk about, analyze, and explicitly learn from.” By examining not only how developers perceive their work, but how practitioners in the field use and value their tools, we intend to illuminate the role of tool development in the Digital Humanities and in the larger world of academia. Bibliography American Council of Learned Societies (ACLS), Our Cultural Commonwealth:The Final Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities & Social Sciences, 2006, <http://www.acls.org/ cyberinfrastructure/OurCulturalCommonwealth.pdf> Arts and Humanities Research Council (AHRC), AHRC ICT Methods Network Workgroup on Digital Tools Development for the Arts and Humanities, 2006, <http://www.methnet.ac.uk/ redist/pdf/wg1report.pdf> Bradley, John. “Finding a Middle Ground between ‘Determinism’ and ‘Aesthetic Indeterminacy’: a Model for Text Analysis Tools.” Literary & Linguistic Computing 18.2 (2003): 185-207. _____________________________________________________________________________ 189 Digital Humanities 2008 _____________________________________________________________________________ Kenny, Anthony. “Keynote Address: Technology and Humanities Research.” In Scholarship and Technology in the Humanities: Proceedings of a Conference held at Elvetham Hall, Hampshire, UK, 9th-12th May 1990, ed. May Katzen (London: Bowker Saur, 1991), 1-10. McCarty, Willard. Humanities Computing. New York: Palgrave Macmillan, 2005. McGann, Jerome. “Culture and Technology: The Way We Live Now, What Is to Be Done?” New Literary History 36.1 (2005): 71-82. Modern Language Association of America, Report of the MLA Task Force on Evaluating Scholarship for Tenure and Promotion, December 2006, <http://www.mla.org/tenure_promotion_ pdf> Ramsay, Stephen. “In Praise of Pattern.” TEXT Technology 14.2 (2005): 177-190. Ramsay, Stephen.“Reconceiving Text Analysis: Toward an Algorithmic Criticism.” Literary and Linguistic Computing 18.2 (2003): 167-174. Schreibman, Susan and Ann Hanlon, Sean Daugherty, Tony Ross. “The Versioning Machine 3.1: Lessons in Open Source [Re]Development”. Poster session at Digital Humanities (Urbana Champaign, June 2007) Schreibman, Susan, Amit Kumar and Jarom McDonald. “The Versioning Machine”. Proceedings of the 2002 ACH/ALLC Conference. Literary and Linguistic Computing 18.1 (2003) 101107 Summit on Digital Tools for the Humanities (September 28-30, 2005), Report on Summit Accomplishments, May 2006. <http://www.iath.virginia.edu/dtsummit/SummitText.pdf> Unsworth, John. “Tool-Time, or ‘Haven’t We Been Here Already?’: Ten Years in Humanities Computing”, Delivered as part of Transforming Disciplines: The Humanities and Computer Science, Saturday, January 18, 2003, Washington, D.C. <http://www.iath.virginia.edu/~jmu2m/carnegie-ninch.03. html> Unsworth, John. “Digital Humanities Centers as Cyberinfrastructure”, Digital Humanities Centers Summit, National Endowment for the Humanities, Washington, DC, Thursday, April 12, 2007 <http://www3.isrl.uiuc.edu/ ~unsworth/dhcs.html> Recent work in the EDUCE Project W. Brent Seales seales@netlab.uky.edu University of Kentucky, USA A. Ross Scaife scaife@gmail.com University of Kentucky, USA Popular methods for cultural heritage digitization include flatbed scanning and high-resolution photography. The current paradigm for the digitization and dissemination of library collections is to create high-resolution digital images as facsimiles of primary source materials. While these approaches have provided new levels of accessibility to many cultural artifacts, they usually assume that the object is more or less flat, or at least viewable in two dimensions. However, this assumption is simply not true for many cultural heritage objects. For several years, researchers have been exploring digitization of documents beyond 2D imaging. Three-dimensional surface acquisition technology has been used to capture the shape of non-planar texts and build 3D document models (Brown 2000, Landon 2006), where structured light techniques are used to acquire 3D surface geometry and a high-resolution still camera is used to capture a 2D texture image. However, there are many documents that are impossible to scan or photograph in the usual way. Take for example the entirety of the British Library’s Cotton Collection, which was damaged by a fire in 1731 (Tite 1994, Prescott 1997, Seales 2000, Seales 2004). Following the fire most of the manuscripts in this collection suffered from both fire and water damage, and had to be physically dismantled and then painstakingly reassembled. Another example is the collection of papyrus scrolls that are in the Egyptian papyrus collection at the British Museum (Smith 1987, Andrews 1990, Lapp 1997). Because of damage to the outer shells of some of these scrolls the text enclosed therein have never been read - unrolling the fragile material would destroy them. Certain objects can be physically repaired prior to digitization. However, the fact is that many items are simply too fragile to sustain physical restoration. As long as physical restoration is the only option for opening such opaque documents, scholarship has to sacrifice for preservation, or preservation sacrifice for scholarship. The famous Herculaneum collection is an outstanding example of the scholarship/preservation dichotomy (Sider 2005). In the 1750s, the excavation of more than a thousand scorched papyrus rolls from the Villa dei Papiri (the Villa of the Papyri) in ancient Herculaneum caused great excitement among contemporary scholars, for they held the possibility _____________________________________________________________________________ 190 Digital Humanities 2008 _____________________________________________________________________________ of the rediscovery of lost masterpieces by classical writers. However, as a result of the eruption of Mt. Vesuvius in A.D. 79 that destroyed Pompeii and also buried nearby Herculaneum, the papyrus rolls were charred severely and are now extremely brittle, frustrating attempts to open them. A number of approaches have been devised to physically open the rolls, varying from mechanical and “caloric”, to chemical (Sider 2005). Substrate breakages caused during the opening create incomplete letters that appear in several separate segments, which makes reading them very difficult. Some efforts have been made to reconstruct the scrolls, including tasks such as establishing the relative order of fragments and assigning to them an absolute sequence (Janko 2003). In addition, multispectral analysis and conventional image processing methods have helped to reveal significant previously unknown texts. All in all, although these attempts have introduced some new works to the canon, they have done so at the expense of the physical objects holding the text. Given the huge amount of labor and care it takes to physically unroll the scrolls, together with the risk of destruction caused by the unrolling, a technology capable of producing a readable image of a rolled-up text without the need to physically open it is an attractive concept. Virtual unrolling would offer an obvious and substantial payoff. In summary, there is a class of objects that are inaccessible due to their physical construction. Many of these objects may carry precise contents which will remain a mystery unless and until they are opened. In most cases, physical restoration is not an option because it is too risky, unpredictable, and labor intensive. This dilemma is well suited for advanced computer vision techniques to provide a safe and efficient solution.The EDUCE project (Enhanced Digital Unwrapping for Conservation and Exploration) is developing a general restoration approach that enables access to those impenetrable objects without the need to open them. The vision is to apply this work ultimately to documents such as those described above, and to allow complete analysis while enforcing continued physical preservation. Proof of Concept With the assistance of curators from the Special Collections Library at the University of Michigan, we were given access to a manuscript from the 15th century that had been dismantled and used in the binding of a printed book soon after its creation. The manuscript is located in the spine of the binding, and consists of seven or so layers that were stuck together, as shown in figure 1. The handwritten text on the top layer is recognizable from the book of Ecclesiastes.The two columns of texts correspond to Eccl 2:4/2:5 (2:4 word 5 through 2:5 word 6) and Eccl 2:10 (word 10.5 through word 16). However, it was not clear what writing appears on the inner layers, or whether they contain any writing at all.We tested this manuscript using methods that we had refined over a series of simulations and real-world experiments, but this experiment was our first on a bona fide primary source. Figure 1: Spine from a binding made of a fifteenth-century manuscript. Following procedures which will be discussed in more detail in our presentation, we were able to bring out several layers of text, including the text on the back of the top layer. Figure 2 shows the result generated by our method to reveal the back side of the top layer which is glued inside and inaccessible.The left and right columns were identified as Eccl. 1:16 and 1:11 respectively. To verify our findings, conservation specialists at the University of Michigan uncovered the back side of the top layer by removing the first layer from the rest of the strip of manuscript. The process was painstaking, in order to minimize damage, and it took an entire day. First, the strip was soaked in water for a couple of hours to dissolve the glue and enhance the flexibility of the material which was fragile due to age; this added a risk of the ink dissolving, although the duration and water temperature were controlled to protect against this happening. Then, the first layer was carefully pulled apart from the rest of the manuscript with tweezers. The process was very slow to avoid tearing the material. Remaining residue was scraped off gently. The back side of the top layer is shown in figure 3. Most of the Hebrew characters in the images are legible and align well with those in the digital images of the manuscript. The middle rows show better results than the rows on the edges. That is because the edge areas were damaged in structure, torn and abraded, and that degraded the quality of the restoration. Figure 2: Generated result showing the back side of the top layer, identified as Eccl. 1:16 and 1:11. _____________________________________________________________________________ 191 Digital Humanities 2008 _____________________________________________________________________________ Lin,Y. and Seales, W. B. “Opaque Document Imaging: Building Images of Inaccessible Texts.” Proceedings of the Tenth IEEE International Conference on Computer Vision (ICCV’05), Figure 3: Photo of the back side of the top layer, once removed from the binding. Without applying the virtual unwrapping approach, the choices would be either to preserve the manuscript with the hidden text unknown or to destroy it to read it. In this case, we first read the text with non-invasive methods then disassembled the artifact in order to confirm our readings. Presentation of Ongoing Work In November 2007, representatives from the Sorbonne in Paris will bring several unopened papyrus fragments to the University of Kentucky to undergo testing following similar procedures to those that resulted in the uncovering of the Ecclesiastes text in the fifteenth-century manuscript. And in June 2008, a group from the University of Kentucky will be at the British Museum scanning and virtually unrolling examples from their collections of papyrus scrolls. This presentation at Digital Humanities 2008 will serve not only as an overview of the techniques that led to the successful test described above, but will also be an extremely up-to-date report of the most recent work of the project.We look forward to breaking down the dichotomy between preservation and scholarship for this particular class of delicate objects. Prescott, A. “’Their Present Miserable State of Cremation’: the Restoration of the Cotton Library.” Sir Robert Cotton as Collector: Essays on an Early Stuart Courtier and His Legacy. Editor C.J. Wright. London: British Library Publications, 1997. 391-454. Seales, W. B. Griffoen, J., Kiernan, K,Yuan, C. J., Cantara, L. “The digital atheneum: New technologies for restoring and preserving old documents.” Computers in Libraries 20:2 (February 2000): 26-30. Seales, W. B. and Lin,Y. “Digital restoration using volumetric scanning.” In Proceedings of the Fourth ACM/IEEE-CS Joint Conference on Digital Libraries (June 2004): 117-124. Sider, D. The Library of the Villa dei Papiri at Herculaneum. Getty Trust Publications: J. Paul Getty Museum, 2005. Smith, M. Catalogue of Demotic Papyri in the British Museum, III: the Mortuary Texts of Papyrus BM 10507. London: British Museum, 1987. Tite, C. G. The Manuscript Library of Sir Robert Cotton. London: British Library, 1994. Bibliography Andrews, C.A.R. Catalogue of Demotic Papyri in the British Museum, IV: Ptolemaic Legal Texts from the Theban Area. London: British Museum, 1990. Brown, M. S. and Seales, W. B. “Beyond 2d images: Effective 3d imaging for library materials.” In Proceedings of the 5th ACM Conference on Digital Libraries (June 2000): 27-36. Brown, M. S. and Seales, W. B. “Image Restoration of Arbitrarily Warped Documents.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 26:10 (October 2004): 12951306. Janko, R. Philodemus On Poems Book One. Oxford University Press, 2003. Landon, G.V. and Seales, W. B. “Petroglyph digitization: enabling cultural heritage scholarship.” Machine Vision and Applications, 17:6 (December 2006): 361-371. Lapp, G. The Papyrus of Nu. Catalogue of Books of the Dead in the British Museum, vol. I. London: British Museum, 1997. _____________________________________________________________________________ 192 Digital Humanities 2008 _____________________________________________________________________________ “It’s a team if you use ‘reply all’”: An Exploration of Research Teams in Digital Humanities Environments Lynne Siemens siemensl@uvic.ca University of Victoria, Canada Introduction This proposal reports on a research project exploring the nature of research teams in the Digital Humanities community. It is a start to understanding the type of supports and research preparation that individuals within the field require to successfully collaborate within research teams. Context Traditionally, research contributions in the humanities field have been felt to be, and documented to be, predominantly solo efforts by academics involving little direct collaboration with others, a model reinforced through doctoral studies and beyond (See, for example, Cuneo, 2003; Newell & Swan, 2000). However, Humanities Computing/Digital Humanities is an exception to this. Given that the nature of research work involves computers and a variety of skills and expertise, Digital Humanities researchers are working collaboratively within their institutions and with others nationally and internationallly to undertake the research. This research typically involves the need to coordinate efforts between academics, undergraduate and graduate students, research assistants, computer programmers, librarians, and other individuals as well as the need to coordinate financial and other resources. Despite this, there has been little formal research on team development within this community. That said, efforts toward understanding the organizational context in which Digital Humanites research is situated is beginning in earnest. Two large-scale survey projects (Siemens et al., 2002; Toms et al., 2004) have highlighted issues of collaboration, among other topics, and Warwick (2004) found that the organizational context has had an impact on the manner in which Digital Humanities/Humanities Computing centres developed in the United States and England. Other studies are underway as well. In addition, McCarty (2005b) explores the ways that computers have opened the opportunities for collaboration within the humanities and has explored the associated challenges of collaboration and team research within the HUMANIST listserve (2005a). Finally, through efforts such as the University of Victoria’s Digital Humanities/Humanities Computing Summer Institute and other similar ventures, the community is working to develop its collaborative capacity through workshops in topics like community-specific project management skills, which also includes discussion of team development and support. This study draws upon these efforts as it explores and documents the nature of research teams within the Digital Humanities community to the end of identifying exemplary work patterns and larger models of research collaboration that have the potential to strengthen this positive aspect of the community even further. Methods This project uses a qualitative research approach with indepth interviews with members of various multi-disciplinary, multi-location project teams in Canada, the United States, and the United Kingdom. The interview questions focus on the participants’ definition of teams; their experiences working in teams; and the types of supports and research preparation required to ensure effective and efficient research results. The results will include a description of the community’s work patterns and relationships and the identification of supports and research preparation required to sustain research teams (as per Marshall & Rossman, 1999; McCracken, 1988). Preliminary Findings At the time of writing this proposal, final data analysis is being completed, but clear patterns are emerging and, after final analysis, these will form the basis of my presentation. The individuals interviewed currently are and have been a part of a diverse set of team research projects, in terms of research objective, team membership size, budget, and geographical dispersion, both within their own institution, nationally, and internationally. The roles they play are varied and include research assistant, researcher, computer programmer, and lead investigator. There are several commonalities among these individuals in terms of their skill development in team research and their definition of research teams and communities. When final data analysis is complete, a series of exemplary patterns and models of research collaboration will be identified and outlined.These patterns and models will include the identification of supports and research preparation which can sustain research teams in the present and into the future. The benefits to the Digital Humanities community will be several. First, the study contributes to an explicit description of the community’s work patterns and relationships. Second, it also builds on previous efforts to understand the organizational context in which Digital Humanities/Humanities Computing centres operate in order to place focus on the role of the individual and teams in research success. Finally, it identifies possible supports and research preparation to aid the further development of successful research teams. _____________________________________________________________________________ 193 Digital Humanities 2008 _____________________________________________________________________________ References Doing Digital Scholarship Cuneo, C. (2003, November). Interdisciplinary teams - let’s make them work. University Affairs, 18-21. Lisa Spiro Marshall, C., & Rossman, G. B. (1999). Designing qualitative research (3rd ed.). Thousand Oaks, California: SAGE Publications. McCarty, W. (2005a). 19.215 how far collaboration? Humanist discussion group. Retrieved August 18, 2005, from http://lists. village.virginia.edu/lists_archive/Humanist/v19/0211.html McCarty, W. (2005b). Humanities computing. New York, NY: Palgrave MacMillan. McCracken, G. (1988). The long interview (Vol. 13). Newbury Park, CA: SAGE Publications. Newell, S., & Swan, J. (2000). Trust and inter-organizational networking. Human Relations, 53(10), 1287-1328. Siemens, R. G., Best, M., Grove-White, E., Burk, A., Kerr, J., Pope, A., et al. (2002). The credibility of electronic publishing: A report to the Humanities and Social Sciences Federation of Canada. Text Technology, 11(1), 1-128. Toms, E., Rockwell, G., Sinclair, S., Siemens, R. G., & Siemens, L. (2004). The humanities scholar in the twenty-first century: How research is done and what support is needed, Joint International Conference of the Association for Computers and the Humanities and the Association for Literary & Linguistic Computing. Göteborg, Sweden. Published results forthcoming; results reported by Siemens, et al., in abstract available at http://www.hum.gu.se/allcach2004/AP/html/ prop139.html. Warwick, C. (2004). “No such thing as humanities computing?” an analytical history of digital resource creation and computing in the humanities, Joint International Conference of the Association for Computers and the Humanities and the Association for Literary & Linguistic Computing. Göteborg, Sweden. (Prepublication document available at http://tapor.humanities.mcmaster.ca/html/ Nosuchthing_1.pdf.). lspiro@rice.edu Rice University, USA When I completed my dissertation Bachelors of Arts: Bachelorhood and the Construction of Literary Identity in Antebellum America in 2002, I figured that I was ahead of most of my peers in my use of digital resources, but I made no pretense of doing digital scholarship. I plumbed electronic text collections such as Making of America and Early American Fiction for references to bachelorhood, and I used simple text analysis tools to count the number of times words such as “bachelor” appeared in key texts. I even built an online critical edition of a section from Reveries of a Bachelor (http://etext.virginia. edu/users/spiro/Contents2.html), one of the central texts of sentimental bachelorhood. But in my determination to finish my PhD before gathering too many more gray hairs, I resisted the impulse to use more sophisticated analytical tools or to publish my dissertation online. Five years later, the possibilities for digital scholarship in the humanities have grown. Projects such as TAPOR, Token-X, and MONK are constructing sophisticated tools for text analysis and visualization. Massive text digitization projects such as Google Books and the Open Content Alliance are making it possible to search thousands (if not millions) of books. NINES and other initiatives are building communities of digital humanities scholars, portals to content, and mechanisms for conducting peer review of digital scholarship. To encourage digital scholarship, the NEH recently launched a funding program. Meanwhile, scholars are blogging, putting up videos on YouTube, and using Web 2.0 tools to collaborate. Despite this growth in digital scholarship, there are still too few examples of innovative projects that employ “digital collections and analytical tools to generate new intellectual products” (ACLS 7). As reports such as A Kaleidoscope of American Literature and Our Cultural Commonwealth suggest, the paucity of digital scholarship results from the lack of appropriate tools, technical skills, funding, and recognition. In a study of Dickinson, Whitman and Uncle Tom’s Cabin scholars, my colleague Jane Segal and I found that although scholars are increasingly using digital resources in their research, they are essentially employing them to make traditional research practices more efficient and gain access to more resources, not (yet) to transform their research methodology by employing new tools and processes (http://library.rice.edu/ services/digital_media_center/projects/the-impact-of-digitalresources-on-humanities-research). What does it mean to do humanities research in a Web 2.0 world? To what extent do existing tools, resources, and research methods support digital scholarship, and what else do scholars need? _____________________________________________________________________________ 194 Digital Humanities 2008 _____________________________________________________________________________ To investigate these questions, I am revisiting my dissertation to re-imagine and re-mix it as digital scholarship. I aim not only to open up new insights into my primary research area--the significance of bachelorhood in nineteenth-century American culture--but also to document and analyze emerging methods for conducting research in the digital environment. To what extent do digital tools and resources enable new approaches to traditional research questions—and to what extent are entirely new research questions and methods enabled? I am structuring my research around what John Unsworth calls the “scholarly primitives,” or core research practices in the humanities: 1. Discovering: To determine how much information is available online, I have searched for the nearly 300 resources cited in my dissertation in Google Books, Making of America, and other web sites. I found that 77% of my primary source resources and 22% of my secondary sources are available online as full-text, while 92% of all my research materials have been digitized (this number includes works available through Google Books as limited preview, snippet view, and no preview.) Although most nineteenth-century books cited in my dissertation are now freely available online, many archival resources and periodicals have not yet been digitized. 2. Annotating: In the past, I kept research notes in long, unwieldy Word documents, which made it hard to find information that I needed. New software such as Zotero enables researchers to store copies of the digital resources and to make annotations as part of the metadata record. What effect does the ability to share and annotate resources have on research practices? How useful is tagging as a mechanism for organizing information? 3. Comparing: Through text analysis and collation software such as Juxta and TAPOR, scholars can compare different versions of texts and detect patterns. Likewise, the Virtual Lightbox allows researchers to compare and manipulate digital images. What kind of new insights can be generated by using these tools? In the course of doing my research, I am testing freely available tools and evaluating their usefulness for my project. 4. Referring: With digital publications, we not only can refer to prior work, but link to it, even embed it. What is the best means for constructing a scholarly apparatus in digital scholarship, particularly in a work focused not only on making an argument, but also on examining the process that shaped that argument? 5. Sampling: With so much information available, what criteria should we use to determine what to focus on? Since not everything is digitized and search engines can be blunt instruments, what do we ignore by relying mainly on digital resources? In my blog, I am reflecting on the selection criteria used to produce the arguments in my revamped dissertation. 6. Illustrating: What kind of evidence do we use to build an argument in a work of digital scholarship, and how is that evidence presented? In my dissertation, I generalized about the significance of bachelorhood in American literature by performing close readings of a few key texts, but such a method was admittedly unsystematic. By using text analysis tools to study a much larger sample of primary texts, I can cite statistics such as word frequency in making my argument--but does this make my argument any more convincing? 7. Representing: How should a work of digital scholarship be presented? Ideally readers would be able to examine the evidence for themselves and even perform their own queries. At the same time, information should be offered so that it is clear and consistent with familiar academic discourse. How should I make available not only research conclusions, but also the detailed research process that undergirds these conclusions--the successful and unsuccessful searches, the queries run in text analysis software, the insights offered by collaborators? How will the digital work compare to the more traditional original dissertation? What kind of tools (for instance, ifBook’s Sophie) will be used to author the work? In addition to Unsworth’s list, I offer two more: 8. Collaborating: Although humanities scholars are thought to be solitary, they collaborate frequently by exchanging bibliographic references and drafts of their essays. How do I engage the community in my research? I am encouraging others to comment on my (re-) work in progress (http:// digitalhumanities.edublogs.org/) using Comment Press. Moreover, I am bookmarking all web-based sources for my study on delicious (http://del.icio.us/lms4w/digital_ scholarship) and making available feeds from my various research sources through a PageFlakes portal (http://www. pageflakes.com/lspiro/). On my blog, “Digital Scholarship in the Humanities,” I explore issues and ideas raised by my research (http://digitalscholarship.wordpress.com/). I am examining what it takes to build an audience and how visibility and collaboration affect my research practices. 9. Remixing: What would it mean to take an earlier work--my own dissertation, for example--use new sources and approaches, and present it in a new form? What constitutes a scholarly remix, and what are the implications for intellectual property and academic ethics? I also plan to experiment with mashups as a means of generating and presenting new insights, such as a Google Map plotting census statistics about antebellum bachelors or a visual mashup of images of bachelors. This project examines the process of doing research digitally, the capabilities and limits of existing tools and resources, and the best means of authoring, representing and disseminating digital scholarship. I aim to make this process as open, visible, and collaborative as possible. My presentation will focus _____________________________________________________________________________ 195 Digital Humanities 2008 _____________________________________________________________________________ on emerging research methodologies in the humanities, particularly the use of tools to analyze and organize information, the development of protocols for searching and selecting resources, and the dissemination of ideas through blogs and multimedia publication. References American Council of Learned Societies (ACLS). Our Cultural Commonwealth:The Report of the American Council of Learned Societies Commission on Cyberinfrastructure for the Humanities and Social Sciences. New York: American Council of Learned Societies, 2006. Brogan, Martha. A Kaleidoscope of Digital American Literature. Digital Library Federation. 2005. 22 May 2007 <http://www. diglib.org/pubs/dlf104/>. Unsworth, John. “Scholarly Primitives: what methods do humanities researchers have in common, and how might our tools reflect this?’ 13 May 2000. 20 November 2007. <http:// jefferson.village.virginia.edu/~jmu2m/Kings.5-00/primitives. html> Extracting author-specific expressions using random forest for use in the sociolinguistic analysis of political speeches Takafumi Suzuki qq16116@iii.u-tokyo.ac.jp University of Tokyo, Japan Introduction This study applies stylistic text classification using random forest to extract author-specific expressions for use in the sociolinguistic analysis of political speeches. In the field of politics, the style of political leaders’ speeches, as well as their content, has attracted growing attention in both English (Ahren, 2005) and Japanese (Azuma, 2006; Suzuki and Kageura, 2006). One of the main purposes of these studies is to investigate political leaders’ individual and political styles by analyzing their speech styles. A common problem of many of these studies and also many sociolinguistic studies is that the expressions that are analyzed are selected solely on the basis of the researcher’s interests or preferences, which can sometimes lead to contradictory interpretations. In other words, it is difficult to determine whether these kind of analyses have in fact correctly identified political leaders’ individual speech styles and, on this basis, correctly characterised their individual and political styles. Another problem is that political leaders’ speech styles may also be characterised by the infrequent use of specific expressions, but this is rarely focused on. In order to solve these problems, we decided to apply stylistic text classification and feature extraction using random forest to political speeches. By classifying the texts of an author according to their style and extracting the variables contributing to this classification, we can identify the expressions specific to that author. This enables us to determine his/her speech style, including the infrequent use of specific expressions, and characterise his/her individual and political style. This method can be used for the sociolinguistic analysis of various types of texts, which is, according to Argamon et al. (2007a), a potentially important area of application for stylistics. Experimental setup We selected the Diet addresses of two Japanese prime ministers, Nakasone and Koizumi, and performed two classification experiments: we distinguished Nakasone’s addresses from those of his 3 contemporaries (1980-1989), and Koizumi’s addresses from those of his 8 contemporaries (1989-2006). Because Nakasone and Koizumi were two of the most powerful prime ministers in the history of Japanese politics, and took a special interest in the content or style of _____________________________________________________________________________ 196 Digital Humanities 2008 _____________________________________________________________________________ their own speeches (Suzuki and Kageura, 2007), their addresses are the most appropriate candidates for initial analysis. Suzuki and Kageura (2006) have demonstrated that the style of Japanese prime ministers’ addresses has changed significantly over time, so we compared their addresses with those of their respective contemporaries selected by the standard division of eras in Japanese political history. We downloaded the addresses from the online database Sekai to Nihon (The World and Japan) (www.ioc.u-tokyo.ac.jp/~worldjpn), and applied morphological analysis to the addresses using ChaSen (Matsumoto et al., 2003). We united notational differences which were distinguished only by kanji and kana in Japanese. Table 1 sets out the number of addresses and the total number of tokens and types for all words, particles and auxiliary verbs in each category. Table 1. Basic data on the corpora As a classification method, we selected the random forest (RF) method proposed by Breiman (2001), and evaluated the results using out-of-bag tests. RF is known to perform extremely well in classification tasks in other fields using large amounts of data, but to date few studies have used this method in the area of stylistics (Jin and Murakami, 2006). Our first aim in using RF was thus to test its effectiveness in stylistic text classification. A second and more important aim was to extract the important variables contributing to classification (which are shown as high Gini coefficients). The extracted variables represent the specific expressions distinguishing the author from others, and they can show the author’s special preference or dislike for specific expressions. Examining these extracted expressions enables us to determine the author’s speech style and characterise his/her individual and political styles. In an analogous study, Argamon et al. (2007b) performed genderbased classification and feature extraction using SVM and information gain, but as they are separate experiments and RF returns relevant variables contributing to classification, RF is more suitable for our purposes. We decided to focus on the distribution of particles and auxiliary verbs because they represent the modality of the texts, information representing authors’ personality and sentiments (Otsuka et al., 2007), and are regarded as reflecting political leaders’ individual and political styles clearly in Japanese (Azuma, 2006, Suzuki and Kageura, 2006). We tested 8 distribution combinations as features (see Table 2). Though Jin (1997) has demonstrated that the distribution of particles (1st order part-of-speech tag) is a good indicator of the author in Japanese, the performances of these features, especially auxiliary verbs, have not been explored fully, and as the microscopic differences in features (order of part-ofspeech and stemming) can affect the classification accuracy, we decided to test the 8 combinations. Results and discussion Table 2 shows the precision, recall rates and F-values. Koizumi displayed higher accuracy than Nakasone, partly because he had a more individualistic style of speech (see also Figure 2 and 3), and partly because a larger number of texts were used in his case. Many of the feature sets show high classification accuracy (more than 70%) according to the criteria of an analogous study (Argamon et al., 2007c), which confirms the high performance of RF. The results also show that the distribution of the auxiliary verbs and combinations can give better performance than that of the particles used in a previous study (Jin, 1997), and also that stemming and a deeper order of part-of-speech can improve the results. Table 2. Precisions and recall rates Figure 1 represents the top twenty variables with high Gini coefficients according to the classification of combinations of features (2nd order and with stemming). The figure indicates that several top variables had an especially important role in classification. In order to examine them in detail, we plotted in Figure 2 and 3 the transitions in the relative frequencies of the top four variables in the addresses of all prime ministers after World War 2. These include variables representing politeness (‘masu’, ‘aru’, ‘desu’), assertion (‘da’, ‘desu’), normativeness (‘beshi’), and intention (‘u’), and show the individual and political styles of these two prime ministers well. For example, ‘aru’ is a typical expression used in formal speeches, and also Diet addresses (Azuma, 2006), and the fact that Koizumi used this expression more infrequently than any other prime minister indicates his approachable speech style and can explain his political success. Also, the figure shows that we can extract the expressions that Nakasone and Koizumi used less frequently than their contemporaries as well as the expressions they used frequently. These results show the effectiveness of RF feature extraction for the sociolinguistic analysis of political speeches. Figure 1.The top twenty variables with high Gini coefficients. The notations of the variables indicate the name of partof-speech (p: particle, av: auxiliary verb) followed by (in the case of particles) the 2nd order part-of speech. _____________________________________________________________________________ 197 Digital Humanities 2008 _____________________________________________________________________________ texts, which will contribute to further expansion in the scope of stylistics.A further study will include more concrete analysis of extracted expressions. References Ahren, K. (2005): People in the State of the Union: viewing social change through the eyes of presidents, Proceedings of PACLIC 19, the 19th Asia-Pacific Conference on Language, Information and Computation, 43-50. Argamon, S. et al. (2007a): Stylistic text classification using functional lexical features, Journal of the American Society for Information Science and Technology, 58(6), 802-822. Figure 2.Transitions in the relative frequencies of the top four variables (Nakasone’s case) in the addresses of all Japanese prime ministers from 1945 to 2006.The ‘NY’s between the red lines represent addresses by Nakasone, and other initials represent addresses by the prime minister with those initials (see Suzuki and Kageura, 2007). Argamon, S. et al. (2007b): Discourse, power, and Ecriture feminine: the text mining gender difference in 18th and 19th century French literature, Abstracts of Digital Humanities, 191. Argamon, S. et al. (2007c): Gender, race, and nationality in Black drama, 1850-2000: mining differences in language use in authors and their characters, Abstracts of Digital Humanities, 149. Azuma, S. (2006): Rekidai Syusyo no Gengoryoku wo Shindan Suru, Kenkyu-sya, Tokyo. Breiman, L. (2001): Random forests, Machine Learning, 45, 5-23. Jin, M. (1997): Determining the writer of diary based on distribution of particles, Mathematical Linguistics, 20(8), 357367. Jin, M. and M. Murakami (2006): Authorship identification with ensemble learning, Proceedings of IPSJ SIG Computers and the Humanities Symposium 2006, 137-144. Matsumoto,Y. et al. (2003): Japanese Morphological Analysis System ChaSen ver.2.2.3, chasen.naist.jp. Figure 3.Transitions in the relative frequencies of the top four variables (Koizumi’s case) in the addresses of all Japanese prime inisters from 1945 to 2006.The ‘KJ’s between the red lines represent addresses by Koizumi, and the other initials represent addresses by the prime minister with those initials (see Suzuki and Kageura, 2007). Conclusion This study applied text classification and feature extraction using random forest for use in the sociolinguistic analysis of political speeches. We showed that a relatively new method in stylistics performs fairly well, and enables us to extract author-specific expressions. In this way, we can systematically determine the expressions that should be analyzed to characterise their individual and political styles. This method can be used for the sociolinguistic analysis of various types of Otsuka, H. et al. (2007): Iken Bunseki Enjin, Corona Publishing, Tokyo, 127-172. Suzuki, T. and K. Kageura (2006): The stylistic changes of Japanese prime ministers’ addresses over time, Proceedings of IPSJ SIG Computers and the Humanities Symposium 2006, 145-152. Suzuki, T. and K. Kageura (2007): Exploring the microscopic textual characteristics of Japanese prime ministers’ Diet addresses by measuring the quantity and diversity of nouns, Proceedings of PACLIC21, the 21th Asia-Pacific Conference on Language, Information and Computation, 459-470. _____________________________________________________________________________ 198 Digital Humanities 2008 _____________________________________________________________________________ Gentleman in Dickens: A Multivariate Stylometric Approach to its Collocation Tomoji Tabata tabata@lang.osaka-u.ac.jp University of Osaka, Japan This study proposes a multivariate approach to the collocation of gentleman in Dickens. By applying a stylo-statistical analysis model based on correspondence analysis (Tabata, 2004, 2005, 2007a, and 2007b) to the investigation of the collocation of the word gentleman, the present study visualizes the complex interrelationships among gentleman’s collocates, interrelationships among texts, and the association patterns between the gentleman’s collocates and texts in multidimensional spaces. By so doing, I shall illustrate how the collocational patterns of gentleman reflects a stylistic variation over time as well as the stylistic fingerprint of the author. One thing that strikes readers of Dickens in terms of his style is the ways he combines words together (i.e. collocation). Dickens sometimes combines words in a quite unique manner, in a way that contradicts our expectation to strike humour, to imply irony or satire, or otherwise. Dickens also uses fixed/ repetitive collocations for characterization: to make particular characters memorable by making them stand out from others. The collocation of word gentleman, one of the most frequent ‘content’ words in Dickens, is full of such examples, and would therefore be worthy of stylistic investigation. The significance of collocation in stylistic studies was first suggested by J. R. Firth (1957) more than half a century ago. Thanks to the efforts by Sinclair (1991) and his followers, who have developed empirical methodologies based on largescale corpora, the study of collocation has become widely acknowledged as an important area of linguistics. However, the vast majority of collocation studies (Sinclair, 1991; Kjellmer, 1994; Stubbs, 1995; Hunston and Francis, 1999, to name but a few) have been concerned with grammar/syntax, lexicography, and language pedagogy, not with stylistic aspects of collocation, by showing the company a word keeps. Notable exceptions are Louw (1993), Adolphs and Carter (2003), Hori (2004), and Partington (2006). The emergence of these recent studies seems to indicate that collocation is beginning to draw increasing attention from stylisticians. Major tools used for collocation studies are KWIC concordances and statistical measures, which have been designed to quantify collocational strength between words in texts. With the wide availability of highly functional concordancers, it is now conventional practice to examine collocates in a span of, say, four words to the left, and four words to the right, of the node (target word/key word) with the help of statistics for filtering out unimportant information. This conventional approach makes it possible to detect a grammatical/syntactic, phraseological, and/or semantic relation between the node and its collocates. However, a conventional approach would not be suitable if one wanted to explore whether the company a word keeps would remain unchanged, or its collocation would change over time or across texts. It is unsuitable because this would involve too many variables to process with a concordancer to include a diachronic or a crosstextual/authorial perspective when retrieving collocational information from a large set of texts. Such an enterprise would require a multivariate approach. Various multivariate analyses of texts have been successful in elucidating linguistic variation over time, variation across registers, variation across oceans, to say nothing of linguistic differences between authors (Brainerd, 1980; Burrows, 1987 & 1996; Biber and Finegan, 1992; Craig, 1999a, b, & c; Hoover, 2003a, b, & c; Rudman, 2005). My earlier attempts used correspondence analysis to accommodate low frequency variables (words) in profiling authorial/chronological/crossregister variations in Dickens and Smollett (Tabata, 2005, 2007a, & c). Given the fact that most collocates of content words tend to be low in frequency, my methodology based on correspondence analysis would usefully be applied to a macroscopic analysis of collocation of gentleman. This study uses Smollett’s texts as a control set against which the Dickens data is compared, in keeping with my earlier investigations. Dickens and Smollett stand in contrast in the frequency of gentleman. In 23 Dickens texts used in this study, the number of tokens for gentleman amounts to 4,547, whereas Smollett employs them 797 times in his seven works. In the normalised frequency scale per million words, the frequency in Dickens is 961.2, while the frequency in Smollett is 714.3. However, if one compares the first seven Dickens texts with the Smollett set, the discrepancy is even greater: 1792.0 versus 714.33 per million words. The word gentleman is significantly more frequent in early Dickens than in his later works. Stubbs (2001: 29) states that “[t]here is some consensus, but no total agreement, that significant collocates are usually found within a span of 4:4”. Following the conventional practice to examine collocation (Sinclaire, 1991; Stubbs, 1995 & 2001), the present study deals with words occurring within a span of four words prior to, and four words following, the node (gentleman) as variables (collocates) to be fed into correspondence analysis. The respective frequency of each collocate is arrayed to form the frequency-profile for 29 texts (Smollett’s The Adventure of an Atom is left out of the data set since it contains no instance of gentleman).The set of 29 collocate frequency profiles (collocate frequency matrix) is then transposed and submitted to correspondence analysis (CA), a technique for data-reduction. CA allows examination of the complex interrelationships between row cases (i.e., texts), interrelationships between column variables (i.e., collocates), and association between the row cases and column variables graphically in a multidimensional space. It computes the row coordinates (word _____________________________________________________________________________ 199 Digital Humanities 2008 _____________________________________________________________________________ scores) and column coordinates (text scores) in a way that permutes the original data matrix so that the correlation between the word variables and text profiles are maximized. In a permuted data matrix, adverbs with a similar pattern of distribution make the closest neighbours, and so do texts of similar profile. When the row/column scores are projected in multi-dimensional charts, relative distance between variable entries indicates affinity, similarity, association, or otherwise between them. Figure 2. Correspondence analysis of the collocates of gentleman: A galaxy of collocates Figure 1. Correspondance analysis of the collocates of gentleman:Text-map Figures 1 and 2 demonstrate a result of correspondence analysis based on 1,074 collocates of gentleman across 29 texts. The horizontal axis of Figure 1 labelled as Dimension 1, the most powerful axis, visualizes the difference between the two authors in the distribution of gentleman’s collocates. It is also interesting that the early Dickensian texts, written in 1830s and early 1840s, are lying towards the bottom half of the diagram.The same holds true for Smollett.Thus, the horizontal axis can be interpreted as indicating authorial variation in the collocation of gentleman, whereas the vertical axis can be interpreted as representing variation over time, although text entries do not find themselves in an exact chronological order. On the other hand, Figure 2 is too densely populated to identify each collocate, except for the outlying collocates, collocates with stronger “pulling power”. However, it would be possible to make up for this by inspecting a diagram derived from much smaller number of variables (say, 100 variables, which will be shown shortly as Figure 4), whose overall configuration is remarkably similar to Figure 2 despite the decrease in the number of variables computed. The Dickens corpus is more than four times the size of the Smollett corpus, and the number of types as well as tokens of gentleman’s collocates in Dickens is more than four times as many as those in Smollett. It is necessary to ensure that a size factor does not come into play in the outcome of analysis. Figures 3 and 4 are derived from the variables of 100 collocates common to both authors. Despite the decrease in the number of variables from 1,074 to 100, the configuration of texts and words is remarkably similar to that based on 1,074 items. The Dickens set and the Smollett set, once again, can be distinguished from each other along Dimension 1. Moreover, in each of the two authors’ sets, early works tend to have lower scores with later works scoring higher along Dimension 2. The results of the present analysis is consistent with my earlier studies based on different variables, such as –ly adverbs, superlatives, as well as high-frequency function words. It would be possible to assume that authorial fingerprints are as firmly set in the collocation of gentleman as in other component of vocabulary. These results seem to illustrate multivariate analysis of collocates could provide interesting new perspectives to the study of collocation. _____________________________________________________________________________ 200 Digital Humanities 2008 _____________________________________________________________________________ References Adolphs, S. and R. A. Carter (2003) ‘Corpus stylistics: point of view and semantic prosodies in To The Lighthouse’, Poetica, 58: 7–20. Biber, D. and E. Finegan (1992) ‘The Linguistic Evolution of Five Written and Speech-Based English Genres from the 17th to the 20th Centuries,’ in M. Rissanen (ed.) History of Englishes: New Methods and Interpretation in Historical Linguistics. Berlin/New York: Mouton de Gruyter. 668–704. Brainerd, B. (1980) ‘The Chronology of Shakespeare’s Plays: A Statistical Study,’ Computers and the Humanities, 14: 221–230. Burrows, J. F. (1987) Computation into Criticism: A study of Jane Austen’s novels and an experiment in method. Oxford: Clarendon Press. Figure 3. Correspondence analysis of 100 most common collocates of gentleman:Text-map Burrows, J. F. (1996) ‘Tiptoeing into the Infinite: Testing for Evidence of National Differences in the Language of English Narrative’, in S. Hockey and N. Ide (eds.) Research in Humanities Computing 4. Oxford/New York: Oxford UP. 1–33. Craig, D. H. (1999a) ‘Johnsonian chronology and the styles of A Tale of a Tub,’ in M. Butler (ed.) Re-Presenting Ben Jonson:Text Performance, History. London: Macmillan, 210–232. Craig, D. H. (1999b) ‘Authorial Attribution and Computational Stylistics: If You Can Tell Authors Apart, Have You Learned Anything About Them?’ Literary and Linguistic Computing, 14: 103–113. Craig, D. H. (1999c) ‘Contrast and Change in the Idiolects of Ben Jonson Characters,’ Computers and the Humanities, 33. 3: 221–240. Firth, J. R. (1957) ‘Modes of Meaning’, in Papers in Linguistics 1934–51. London: OUP. 191-215. Figure 4. Correspondance analysis of 100 most common collocates of gentleman: Word-map of 100 collocates Hoover, D. L. (2003a) ‘Frequent Collocations and Authorial Style,’ Literary and Linguistic Computing, 18: 261–286. Hoover, D. L. (2003b) ‘Multivariate Analysis and the Study of Style Variation,’ Literary and Linguistic Computing, 18: 341–360. Hori, M. (2004) Investigating Dickens’ style: A Collocational Analysis. New York: Palgrave Macmillan. Hunston, S. and G. Framcis (1999) Pattern Grammar: A CorpusDriven Approach to the Lexical Grammar of English. Amsterdam: John Benjamins. Kjellmer, G. (1994) A Dictionary of English Collocations: Based on the Brown Corpus (3 vols.). Oxford: Clarendon Press. Louw, W. (1993) ‘Irony in the text or insincerity in the writer? The diagnostic potential of semantic prosodies’, reprinted in G. Sampson and D. McCarthy (eds.) (2004) Corpus Linguistics: _____________________________________________________________________________ 201 Digital Humanities 2008 _____________________________________________________________________________ Readings in a Widening Discipline. London and New York: Continuum. 229–241. Partington, A. (2006) The Linguistics of Laughter: A CorpusAssisted Study of Laughter-Talk. London/New York: Routledge. Rudman, J. (2005) ‘The Non-Traditional Case for the Authorship of the Twelve Disputed “Federalist” Papers: A Monument Built on Sand?”, ACH/ALLC 2005 Conference Abstracts, Humanities Computing and Media Centre, University of Victoria, Canada, 193–196. Sinclair, J. (1991) Corpus, Concordance, Collocation. Oxford: OUP. Stubbs, M. (1995) ‘Corpus evidence for norms of lexical collocation’, in G. Cook and B. Seidlhofer (eds.) Principle and Practice in Applied Linguistics: Studies in Honour of H. G. Widdowson. Oxford: OUP. 243–256. Stubbs, M. (2001) Words and Phrases: Corpus studies of lexical semantics. Oxford: Blackwell. Tabata, T. (2004) ‘Differentiation of Idiolects in Fictional Discourse: A Stylo-Statistical Approach to Dickens’s Artistry’, in R. Hiltunen and S. Watanabe (eds.) Approaches to Style and Discourse in English. Osaka: Osaka University Press. 79–106. Tabata, T. (2005) ‘Profiling stylistic variations in Dickens and Smollett through correspondence analysis of low frequency words’, ACH/ALLC 2005 Conference Abstracts, Humanities Computing and Media Centre, University of Victoria, Canada, 229–232. Tabata, T. (2007a) ‘A Statistical Study of Superlatives in Dickens and Smollett: A Case Study in Corpus Stylistics’, Digital Humanities 2007 Conference Abstracts,The 19th Joint International Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, University of Illinois, Urbana-Champaign, June 4–June 8, 2007, 210–214. Tabata, T. (2007b) ‘The Cunningest, Rummest, Superlativest Old Fox: A multivariate approach to superlatives in Dickens and Smollett ’, PALA 2007—Style and Communication—Conference Abstracts,The 2007 Annual Conference of the Poetics and Linguistics Association, 31 Kansai Gaidai University, Hirakata, Osaka, Japan, July–4 August 2007, 60–61. Video Game Avatar: From Other to Self-Transcendence and Transformation Mary L.Tripp mtripp@mail.ucf.edu University of Central Florida, USA The aim of this paper is to establish a model for the relationship between the player of a video game and the avatar that represents that player in the virtual world of the video game. I propose that there is an evolution in the identification of the player of a video game with the avatar character that performs embodied actions in the virtual world of the game. This identification can be described through an examination of theories from a variety of subject areas including philosophy, literary studies, video game theory, and educational theory. Specifically, theories in hermeneutics, literary immersion, embodiment, empathy, narrative, game ego, play, and learning theory are synthesized to produce a broad picture of the player/avatar relationship as it develops over time. I will identify stages in the process of feeling immersed in a game. This feeling of immersion can, but may not necessarily will, occur as the player engages in game play over a period of time. I will identify this process in stages: “other”; “embodied empathy”; “self-transcendence”; and “transformation.” At the final stage, the game play can offer a critique of the player’s worldview, to call into question the player’s approach and even presuppositions about the world. I suggest here, the player no longer sees the avatar as an “other” a character to be understood in some way, and not as some virtual representation of the self, something apart from himself, and not even as some sort of virtual embodiment of himself. I suggest that the player transcends even the avatar as any sort of representation, that the avatar disappears from the consciousness of the player altogether. A result of this critique is a transformation that takes place within the player that may actually transform one’s self-understanding, thus providing the player with an authentic learning experience. There is a sense of embodiment that goes along with controlling an avatar in a virtual world that, I believe, is not present when becoming immersed in a film or a book. In a video game, the player’s actions have direct and immediate consequences in terms of reward or punishment as well as movement through the virtual world. The narrative effect helps drive the character through this reward and punishment. The idea of affordances is often used in terms of game design and the use of emotional affordances to help propel the player into the virtual environment. Thus, with a video game, there is a more complete immersion into the constructed imaginary world— an embodied immersion and an emotional immersion. I would like to clarify the definition of this feeling of immersion that happens as a video game progresses. Various terms are used in the fields of literary analysis, psychology, video game _____________________________________________________________________________ 202 Digital Humanities 2008 _____________________________________________________________________________ theory, and philosophy. Terms like literary transport, flow, presence, immersion, identification, and self-transcendence are often used to describe the feeling of the self moving out of the real world and into a constructed imaginary world. I use Janet Murray’s offers a commonly accepted definition used in the gaming industry for this term, a sense of being physically submerged in water (Murray 1997, p. 99). My goal here is to illuminate the idea that there is a process to gaining this feeling of immersion. The player proceeds step-by-step into the game world. The work of Gallagher in philosophy of mind acknowledges that there is a growth process that takes place in developing an understanding of other people, through primary and secondary interaction. Gallese offers an alternative argument that the biological basis of this growth in understanding happen initially as a neural response from the mirror neuron system, then as a more mature empathetic response. Ultimately, I would like to establish a sequence that combines the work of Gallese and Gallagher with the work of digital narrative theorists Ryan and Murray. The sequence of immersion could be stated as: other→empathy→selftranscendence→transformation. The first stage of approaching the avatar is learning the mechanics of the game. In this stage the player sees the avatar as “other,” as an awkward and theoretical representation of the self. Initially, the avatar is a picture or symbol for the self. The player learns to manipulate the keyboard, mouse, or controller. The avatar is “other,” a foreign virtual object that must be awkwardly manipulated by the player. In this stage there is little personal identification with the avatar, except on a theoretical level. The avatar cannot function efficiently according to the will of the player, but operates as a cumbersome vehicle for the player’s intentions. I propose that the player in the second stage the player begins to view the avatar as a character in a world, and the player begins to empathize with the character in a novel or in a film. Let me clarify, here, that at this stage the avatar is still viewed as Other, but a component of empathy now emerges through the use of narrative. The player now has learned to effectively manipulate the character, to move the character through the virtual world. I believe, here, embodiment plays an important biological role in helping to establish this feeling of empathy for the character. There is also an important component of narrative that drives the player toward empathy with the avatar.The work of Marie-Laure Ryan is an important theoriest in this area. In the third stage, which I call “self-transcendence,” the player experiences full identification with the avatar, not as empathizing with another character, but embodying the actions and world of the avatar as if it were his own. On this point, I will refer to Gadamer’s idea of the self-transcendence of play as well as his article “The Relevance of the Beautiful” and several authors working on the concept of immersion and game ego in video game theory (Murray, Wilhelmsson, Ryan). The literature on video game theory uses the terms immersion, flow, and presence in a similar way, but I feel Gadamer’s term “selftranscendence” more aptly fits the description I will offer. At the point of self-transcendence, transformation can happen. Learning theory and the philosophy of hermeneutics, especially as stated in the work of Gallagher in Hermeneutics and Education and the hermeneutic theories of Gadamer (specifically his fusion of horizons), can establish how understanding happens and how transformation of the self can take place more completely in the virtually embodied world of video games. View of the avatar character by the player of a video game as game play progresses. References Brna, Paul. “On the Role of Self-Esteem, Empathy and Narrative in the Development of Intelligent Learning Environments” Pivec, Maja ed. Affective and Emotional Aspects of Human-Computer Interaction: Game-Based and Innovative Learning Approaches. IOS Press: Washington, DC. 2006. Dias, J., Paiva, A.,Vala, M., Aylett, R., Woods S., Zoll, C., and Hall, L. “Empathic Characters in Computer-Based Personal and Social Education.” Pivec, Maja ed. Affective and Emotional Aspects of Human-Computer Interaction: Game-Based and Innovative Learning Approaches. IOS Press: Washington, DC. 2006. Gadamer, H.G. Truth and Method. Weinsheimer and Marshall, trans. New York: Crossroad, 1989. Gallagher, Shaun. Hermeneutics and Education. Albany: State University of New York Press, 1992. Gallagher, Shaun. How the Body Shapes the Mind. Oxford University Press: New York, 2005. Gallagher, S. and Hutto, D. (in press-2007). Understanding others through primary interaction and narrative practice. In: Zlatev, Racine, Sinha and Itkonen (eds). The Shared Mind: Perspectives on Intersubjectivity. Amsterdam: John Benjamins. Gallese,V. “The ‘Shared Manifold’ Hypothesis: From Mirror Neurons to Empathy.” Journal of Consciousness Studies. 8:3350, 2001. Goldman, A. “Empathy, Mind, and Morals.” In Mental Simulation. Ed. M. Davies and T. Stone, 185-208. Oxford: Blackwell, 1995. Grodal, Torben. “Stories for Eye, Ear, and Muscles:Video Games, Media, and Embodied Experiences.” The Video Game Theory Reader. Mark J. Wolf and Bernard Perron, eds. Routledge: New York, 2003. _____________________________________________________________________________ 203 Digital Humanities 2008 _____________________________________________________________________________ Johnson, Mark (1987) The Body in the Mind:The Bodily Basis of Meaning, Imagination and Reason. Chicago: Chicago University Press, 1987 and 1990. Lahti, Martti. “As We Become Machines: Corporealized Pleasures in Video Games.” The Video Game Theory Reader. Mark J. Wolf and Bernard Perron, eds. Routledge: New York, 2003. McMahan, Alison. (2003). “Immersion, Engagement, and Presence: A Method for Analysing 3-D Video Games.” The Video Game Theory Reader. Mark J. Wolf and Bernard Perron, eds. Routledge: New York. Murray, Janet. Hamlet on the Holodeck:The Future of Narrative in Cyberspace. Cambridge: MIT Press, 1997. Ryan, Marie-Laure. Narrative as Virtual Reality: Immersion and Interactivity in Literature and Electronic Media. Baltimore: Johns Hopkins University Press, 2001. Sykes, Jonathan. (2006). “Affective Gaming: Advancing the Argument for Game-Based Learning.” Pivec, Maja ed. Affective and Emotional Aspects of Human-Computer Interaction: Game-Based and Innovative Learning Approaches. IOS Press: Washington, DC. White, H. (1981). “Narrative and History.” In W.J.T. Mitchell, ed., On Narrative. Chicago: University of Chicago Press. Wilhelmsson, Ulf. (2006). “What is a Game Ego? (or How the Embodied Mind Plays a Role in Computer Game Environments).” Pivec, Maja ed. Affective and Emotional Aspects of Human-Computer Interaction: Game-Based and Innovative Learning Approaches. IOS Press: Washington, DC. Wittgenstein, Ludwig. Philosophical Investigations. Trans. G.E.M. Anscombe. Blackwell Publishing: Cambridge, 1972. Normalizing Identity:The Role of Blogging Software in Creating Digital Identity Kirsten Carol Uszkalo kuszkalo@stfx.ca St. Francis Xavier University, Canada Darren James Harkness webmaster@staticred.net University of Alberta, Canada We offer a new in-depth methodology for looking at how the use of blogging software delineates and normalizes the blogger’s creation of posts and, by extension, the creation of her blog self. The simple choice of software is not simple at all and in fact has a great influence on the shape a blogger’s identity will take through the interface, program design, and data structures imposed on her by the software. This primarily technical discussion of a topic seldom considered in studies which look at the cultural impact of blogging, will illuminate the inner workings of the medium and gives due credence to Marshall McLuhan’s argument that “the ‘message’ of any medium or technology is the change of scale or pace or pattern that it introduces into human affairs”. Technology plays such an integral part in distinguishing the blog as a new medium, apart from that of written (paper-based) text; as a result, a study of the blog that does not investigate its infrastructure is incomplete. Critics of blogging scholarship point out the lack of technical discussion around the software used by bloggers as a weakness (Scheidt 2005, Lawley 2004). The criticism is valid; critics’ attention has been focused on the output of the blogs, categorizing them as an extension of existing genres, whether that be of the diary (Rak 167) or the newspaper (Koman). This study serves as a response to the criticism, and aims to start the discussion by looking into the dark recesses of the software, databases, and code to illustrate just how influential infrastructure is in defining identity. Programmers do not care about the content of any given blog. The people who develop Movable Type, Blogger, LiveJournal, and Wordpress are developing software which helps making blogging a much simpler process, and they do listen to customer requests for features. But the developer is not concerned whether your blog will be an online journal, a political commentary, or a collection of cat pictures – what she is concerned about is memory allocation, disk usage, and transaction speed. Every shortcut taken in the source code, every data type or archiving scheme not supported, every function written, and every decision made by the programmer to achieve these goals has an influence on the interface, and therefore on the content the blogger produces. Despite working at an indifferent distance, the developer heavily influences the blog – and by extension, blogger’s identity – by the decisions she makes when she codes the software. _____________________________________________________________________________ 204 Digital Humanities 2008 _____________________________________________________________________________ The way we structure language helps create meaning; likewise, the way in which it is stored also has meaning. To the programmer, language is nothing more than a set of bits and data types, which must be sorted into different containers. How the programmer deals with data affects how she creates the interface; if she has no data structure in place to handle a certain kind of information, she cannot request it from the user in the interface. The data structure is created through a process called normalization – breaking data down into its smallest logical parts. Developers normalize data in order to make it easier to use and reuse in a database: the title of your blog entry goes in one container; the body text goes into another, and so on. The structure of the data does not necessarily match the structure of its original context, however. Although a title and the body text are related to the same entry, no consideration is given by the developer as to whether one comes before the other, whether it should be displayed in a specific style, or if one has hierarchical importance over the other in the page. The data structure is dictated by the individual pieces of data themselves. The developer takes the data within each of these containers and stores it within a database. This may be a simple database, such as a CSV1 or Berkeley DB2 file, or it may reside within a more complex relational database such as MySQL or Microsoft SQL Server. Within the database exists a series of tables, and within each table resides a series of fields. A table holds a single record of data – a blog entry – and the table’s fields hold properties of that data, such as the title or entry date. Figure 1 illustrates an example of the above; a developer has created an Entries table with the fields EntryID3, Title, Date, BodyText, ExtendedText, Keywords, Category, and Post Status. When is possible, such as with the Category and Post Status fields, the developer will actually replace a string (alphanumeric) value with a numeric pointer to the same data within another table in the database. For example, an author may create a set of categories for her blog (such as “Personal Life,” “School,” et cetera, which are stored in a separate database table named Categories and associated with a unique ID (CategoryID). When an entry is marked with the Personal category, the software queries the database to see what the CategoryID of the Personal category is in the Categories table, and places that in the Category field in an entry’s record in the Entries table (see Figure 2). This sets up a series of relations within a database, and helps keep the database smaller; an integer takes far less space in the database than a string: 1 byte to store a single-digit integer, compared to 8 bytes for the string “Personal”; when you start working with hundreds of entries, this difference adds up quickly. It is also easier to maintain; if you want to rename the “Personal” category to “Stories from the woeful events of my unexaggerated life” for example, you would only have to update the entry once in the Categories table; because it is referenced by its CategoryID in each entry, it will automatically be updated in all records that reference it. By abstracting often-used data such as a category into separate database tables, data can be reused within the database, which in turn keeps the size of the database smaller. If we know we will be referring to a single category in multiple entries, it makes sense to create a table of possible categories and then point to their unique identifier within each individual entry. Each field within a database table is configured to accept a specific format of information known as a data type. For example, the Date field in the Entries table above would be given a data type of DATETIME,4 while the Category field would be given a data type of INT (to specify an integer value). The body text of an entry would be placed in a binary data type known as the BLOB, since this is a type of data whose size is variable from record to record. Normalization conditions data to its purpose, and ensures that the developer always knows what kind of data to expect when he or she retrieves it later. It also has the benefit of loosely validating the data by rejecting invalid data types. If an attempt to store a piece of INT data in the Date field is made, it will trigger an error, which prevents the data from being misused within an application. The decisions made by the developer at this point, which involve configuring the tables and fields within the database, ultimately determine what will appear in the blog’s interface. If tables and fields do not exist in the database to support categorization of an entry, for example, it is unlikely to appear in the interface since there is no facility to store the information (and by extension, not prompt the blogger to categorize her thoughts). The interface gives the blogger certain affordances, something Robert St. Amant defines as “an ecological property of the relationship between an agent and the environment” (135).5 Amant describes affordance as a function we can see that is intuitive: “we can often tell how to interact with an object or environmental feature simply by looking at it, with little or no thought involved” (135, 136) – for example, we instinctively know not only what a chair is for, but the best way to make use of it. St. Amant further breaks down the affordance into four separate affordance-related concepts: relationship, action, perception, and mental construct (136-7). He goes on to discuss how to incorporate the idea of affordance into developing a user interface, focusing on action and relationship. The last of these concepts, affordance as a mental construct, is most relevant to our discussion. St. Amant writes “these mental affordances are the internal encodings of symbols denoting relationships, rather than the external situations that evoke the symbols” (137). In the authoring of the blog, the affordance of developing identity cannot be pinned on a single HTML control or text box; it is the process as a whole. LiveJournal and DiaryLand, for example, have the affordance of keeping a personal journal, or online diary. Blogger has the affordance of developing identity in a broader way by not necessarily focusing it on an autobiographical activity. The interface leads the blogger into a mode of writing through the affordances it provides The infrastructure of the blog is its most fundamental paratextual element creating a mirror for the blogger to peer into, but it is the blogger that makes the decision to look. _____________________________________________________________________________ 205 Digital Humanities 2008 _____________________________________________________________________________ Notes 1 Comma-Separated Values. Each record of data consists of a single, unbroken line within a text file, which contains a series of values – each separated by a comma or other delimiter. An example of a CSV file for our entry would look like the following: EntryID,Title,Da te,BodyText,ExtendedText, Keywords,Category, PostStatus 1,My Entry,12/15/2006,This is the entry,,personal,Personal,Published 2,My Other Entry,12/20/2006,Look – another entry,And some extended text,personal,Personal,Published 2 Berkeley DB is a file-based database structure, which offers some basic relational mechanisms, but is not as robust or performant as other database systems. 3 For compatibility with multiple database systems, spaces are generally discouraged in both table and field names. 4 For the purposes of this section, I will use MySQL Data types. Data types may vary slightly between different database applications. 5 As an interesting historical footnote, St. Amant wrote about the affordance in user interfaces around the time the first blog software packages were released in 1999. References Bolter, J.D. & Grusin, R. Remediation: Understanding New Media. Cambridge, MA: The MIT Press, 1999 Lawley, Liz. “Blog research issues” Many 2 Many: A group weblog on social software. June 24, 2004. Online. <http://many. corante.com/archives/2004/06/24/blog_research_issues.php> Accessed October 23, 2007. Koman, Richard. “Are Blogs the New Journalism” O’Reilly Digital Media. January 8, 2005. Online. Accessed April 8, 2007. http://www.oreillynet.com/digitalmedia/blog/2005/01/are_ blogs_the_new_journalism.html McLuhan, Marshall. Understanding Media: The Extension of Man. Cambridge, Mass: MIT Press, 1994. Rak, Julie (2005, Winter). “The Digital Queer: Weblogs and Internet Identity.” Biography 28(1): 166-182. Scheidt, Lois Ann. “Quals reading – Outlining in Blood” Professional-Lurker: Comments by an Academic in Cyberspace. Online. Accessed May 23, 2005. <http://www. professional-lurker.com/archives/000410.html> Serfaty,Viviane. The Mirror and the Veil: An Overview of American Online Diaries and Blogs. Amsterdam & New York: Rodopi, 2004. St. Amant, Robert. “Planning and User Interface Affordances” Proceedings of the 4th international conference on Intelligent user interfaces. 1998. A Modest proposal. Analysis of Specific Needs with Reference to Collation in Electronic Editions Ron Van den Branden ron.vandenbranden@kantl.be Centrum voor Teksteditie en Bronnenstudie (KANTL), Belgium Text collation is a vital aspect of textual editing; its results feature prominently in scholarly editions, either in printed or electronic form. The Centre for Scholarly Editing and Document Studies (CTB) of the Flemish Royal Academy for Dutch Language and Literature has a strong research interest in electronic textual editing. Much of the research for the electronic editions of De teleurgang van den Waterhoek (2000) and De trein der traagheid (forthcoming) concerned appropriate ways of visualising textual traditions. Starting from the functionality of collation results in electronic editions, as illustrated in the latter edition, this paper will investigate the needs for a dedicated automatic text collation tool for textcritical research purposes. The second section sets out with identifying different scenarios in the production model of electronic scholarly editions (based on Ott, 1992;Vanhoutte, 2006). Consequently, possible criteria for an automatic text collation tool in these production scenarios are identified, both on a collation-internal and generic software level. These criteria are analysed both from the broader perspective of automatic differencing algorithms and tools that have been developed for purposes of automatic version management in software development (Mouat, 2002; Komvoteas, 2003; Cobéna, 2003; Cobéna e.a., 2002 and [2004]; Peters, 2005; Trieloff, 2006), and from the specific perspective of text encoding and collation for academic purposes (Kegel and Van Elsacker, 2007; Robinson, 2007). Collation-internal criteria distinguish between three phases of the collation process: the automatic text comparison itself, representation of these comparisons and aspects of visualisation. Especially in the context of electronic scholarly editions, for which TEI XML is the de facto encoding standard, a degree of XMLawareness can be considered a minimal requirement for collation algorithms. A dedicated collation algorithm would be able to track changes on a structural and on word level. Moreover, since textual editing typically deals with complex text traditions, the ability to compare more than two versions of a text could be another requirement for a collation algorithm. Regarding representation of the collation results, an XML perspective is preferable as well.This allows for easier integration of the collation step with other steps in electronic editing. In a maximal scenario, a dedicated text collation tool would represent the collation results immediately in one or other TEI flavour for encoding textual variation. Visualisation of the collation results could be considered a criterion for a text collation tool, but seems less vital, however prominent it _____________________________________________________________________________ 206 Digital Humanities 2008 _____________________________________________________________________________ features in the broader context of developing an electronic edition. From a generic software perspective, a dedicated tool would be open source, free, multi-platform, and embeddable in other applications. These criteria are summarised in a minimal and a maximal scenario. The plea for a tool that meets the criteria for a minimal scenario is illustrated in the third section of the paper. The third section of the paper is a case study of another electronic edition in preparation at the CTB: the complete works of the 16th century Flemish poetess Anna Bijns, totalling around 300 poems that come in about 60 variant pairs or triplets. The specific circumstances of this project led to an investigation for collation solutions in parallel with the transcription and markup of the texts. After an evaluation of some interesting candidate tools for the collation step, the choice was made eventually to investigate how a generic XML-aware comparison tool could be put to use for multiple text collations in the context of textual editing. Besides formal procedures and a set of XSLT stylesheets for different processing stages of the collation results, this development process provided an interesting insight in the specific nature of ‘textual’ text collation, and the role of the editor. The description of the experimental approach that was taken will illustrate the criteria sketched out in the previous section, indicate what is possible already with quite generic tools, and point out the strengths and weaknesses of this approach. Finally, the findings are summarised and a plea is made for a text collation tool that fills the current lacunae with regards to the current tools’ capacities for the distinct steps of comparison, representation and visualisation. References Cobéna, Grégory (2003). Change Management of semistructured data on the Web. Doctoral dissertation, Ecole Doctorale de l’Ecole Polytechnique. <ftp://ftp.inria.fr/INRIA/ publication/Theses/TU-0789.pdf> Cobéna, Grégory, Talel Abdessalem,Yassine Hinnach (2002). A comparative study for XML change detection.Verso report number 221. France: Institut National de Recherche en Informatique et en Automatique, July 2002. <ftp://ftp.inria. fr/INRIA/Projects/verso/VersoReport-221.pdf> Kegel, Peter, Bert Van Elsacker (2007). ‘“A collection, an enormous accumulation of movements and ideas”. Research documentation for the digital edition of the Volledige Werken (Complete Works) of Willem Frederik Hermans’. In: Georg Braungart, Peter Gendolla, Fotis Jannidis (eds.), Jahrbuch für Computerphilologie 8 (2006). Paderborn: mentis Verlag. p. 6380. <http://computerphilologie.tu-darmstadt.de/jg06/kegelel. html> Komvoteas, Kyriakos (2003). XML Diff and Patch Tool. Master’s thesis. Edinburgh: Heriot Watt University. <http://treepatch. sourceforge.net/report.pdf> Mouat, Adrian (2002). XML Diff and Patch Utilities. Master’s thesis. Edinburgh: Heriot Watt University, School of Mathematical and Computer Sciences. <http://prdownloads. sourceforge.net/diffxml/dissertation.ps?download> Ott, W. (1992). ‘Computers and Textual Editing.’ In Christopher S. Butler (ed.), Computers and Written Texts. Oxford and Cambridge, USA: Blackwell, p. 205-226. Peters, Luuk (2005). Change Detection in XML Trees: a Survey. Twente Student Conference on IT, Enschede, Copyright University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. <http:// referaat.cs.utwente.nl/documents/2005_03_B-DATA_AND_ APPLICATION_INTEGRATION/2005_03_B_Peters,L. J.-Change_detection_in_XML_trees_a_survey.pdf> Robinson, Peter (2007). How CollateXML should work. Anastasia and Collate Blog, February 5, 2007. <http://www. sd-editions.com/blog/?p=12> Trieloff, Lars (2006). Design and Implementation of a Version Management System for XML documents. Master’s thesis. Potsdam: Hasso-Plattner-Institut für Softwaresystemtechnik. <http://www.goshaky.com/publications/master-thesis/XMLVersion-Management.pdf> Vanhoutte, Edward (2006). Prose Fiction and Modern Manuscripts: Limitations and Possibilities of Text-Encoding for Electronic Editions. In Lou Burnard, Katherine O’Brien O’Keeffe, John Unsworth (eds.), Electronic Textual Editing. New York: Modern Language Association of America, p. 161-180. <http://www.tei-c.org/Activities/ETE/Preview/vanhoutte.xml> Cobéna, Grégory, Talel Abdessalem,Yassine Hinnach ([2004]). A comparative study of XML diff tools. Rocquencourt, [Unpublished update to Cobéna e.a. (2002), seemingly a report for the Gemo project at INRIA.] <http://www. deltaxml.com/dxml/90/version/default/part/AttachmentData/ data/is2004.pdf> _____________________________________________________________________________ 207 Digital Humanities 2008 _____________________________________________________________________________ (Re)Writing the History of Humanities Computing The History of Humanities Computing Edward Vanhoutte Researching and writing the history of humanities computing is no less but neither no more problematic than researching and writing the history of computing, technology in general, or the history of recent thought. Beverley Southgate considers the history of thought as an all-embracing subject matter which can include the history of philosophy, of science, of religious, political, economic, or aesthetic ideas, ‘and indeed the history of anything at all that has ever emerged from the human intellect.’ (Southgate, 2003, p. 243) Attached to the all-embracing nature of what she then calls ‘intellectual history/history of ideas’ is the defiance from the constraints of disciplinary structures it creates with its practitioners. The history of humanities computing for instance must consider the history of the conventional schools of theory and practice of humanities disciplines, the general histories of computing and technology, the history of relevant fields in computing science and engineering, and the history of the application of computational techniques to the humanities. edward.vanhoutte@kantl.be Centrum voor Teksteditie en Bronnenstudie (KANTL), Belgium Introduction On several occasions, Willard McCarty has argued that if humanities computing wants to be fully of the humanities, it needs to become historically self-aware (McCarty, 2004, p.161). Contrary to what the frequently repeated myth of humanities computing claims (de Tollenaere, 1976; Hockey, 1980, p. 15; 1998, p. 521; 2000, p. 5; 2004, p. 4), humanities computing, like many other interdisciplinary experiments, has no very wellknown beginning. A Common History Several authors contradict each other when it comes to naming the founding father of digital humanities. At least four candidates compete with each other for that honour. Depending on whether one bases one’s research on ideas transmitted orally or in writing, on academic publications expressing and documenting these ideas, or on academic writings publishing results, Father Roberto Busa s.j., the Reverend John W. Ellison, Andrew D. Booth, and Warren Weaver are the competitors. This is not surprising since, when the use of automated digital techniques were considered to process data in the humanities, two lines of research were activated: Machine Translation (MT) and Lexical Text Analysis (LTA). Booth and Weaver belonged to the MT line of research and were not humanists by training, whereas Busa and Ellison were LTA scholars and philologists. As Antonio Zampolli pointed out, the fact that MT was promoted mainly in ‘hard science’ departments and LTA was mainly developed in humanities departments did not enhance frequent contacts between them. (Zampolli, 1989, p. 182) Although real scholarly collaboration was scarce, contacts certainly existed between the two on the administrative level. Since, as Booth (1967, p.VIII) observes, pure linguistic problems such as problems of linguistic chronology and disputed authorship, have profited from work on MT, the transition from MT to Computational Linguistics, dictated by the infamous US National Academy of Sciences ALPAC report (ALPAC, 1966), was a logical, though highly criticized, step. Moreover, the early writings on MT mention the essential use of concordances, frequency lists, and lemmatization – according to Zampolli (1989) typical products of LTA – in translation methods, which shows that both Computational Linguistics and Humanities Computing share a common history.This early history of both has never been addressed together, but it seems necessary for a good understanding of what it is that textual studies do with the computer. The History of Recent Things The history of recent things, however, poses some new and unique challenges for which, in Willard McCarty’s vision, a different conception of historiography is needed. As an illustration for his point, McCarty focuses on the different qualities of imagination that characterize the research and writing of the classical historian on the one hand and the historian of recent things on the other and situates them in a temporal and a spatial dimension respectively. A successful classical historian must manage the skill to move into the mental world of the long dead, whereas the historian of the recent past, like the anthropologist, must master the movement away ‘from the mental world we share with our subjects while remaining engaged with their work’ (McCarty, 2004, p. 163). In a rash moment, this double awareness of the historian of the recent is fair game to triumphalism, the historian’s worst fiend. The temptation to search for historical and quantitative, rather than qualitative, ‘evidence’ to prove the importance of their own field or discipline is innate in somewhat ambitious insiders attempting at writing the history of their own academic field. Also since, as Alun Munslow has reiterated, it is historians rather than the past that generates (writes, composes?) history (Munslow, 1999), McCarty’s astute but non-exhaustive catalogue of the new and unique challenges presents itself as a warning for the historiographer of the new, and I endorse this list fully: ‘volume, variety, and complexity of the evidence, and difficulty of access to it; biases and partisanship of living informants; unreliability of memory; distortions from the historian’s personal engagement with his or her informants—the ‘Heisenberg effect’, as it is popularly known; the ‘presentism’ of science and its perceived need for legitimation through an official, triumphalist account; and so on.’ (McCarty, 2004, p. 163) _____________________________________________________________________________ 208 Digital Humanities 2008 _____________________________________________________________________________ All history is inevitably a history for, and can never be ideologically neutral, as Beverly Southgate has recently emphasized in a review of Marc Ferro’s The Use and Abuse of History, or, How the Past is Taught. (Southgate, 2005) Therefore the question can never be ‘Comment on raconte l’histoire aux enfants à travers le monde entier’ as the original title of Ferro’s book reads. Greg Dening would say that writing history is a performance in which historians should be ‘focused on the theatre of what they do’. (Dening, 1996, p. 30) According to McCarty, the different conception of historiography, could profit from acknowledging ethnographic theory and ethnography as a contributory discipline since it entails a poetic of ‘dilatation beyond the textable past and beyond the ‘scientific’ reduction of evidence in a correct and singular account.’ (McCarty, 2004, p. 174) Prolegomena The current lack of theoretical framework that can define and study the history of humanities computing echoes what Michael Mahoney wrote with respect to the history of computing: ‘The major problem is that we have lots of answers but very few questions, lots of stories but no history, lots of things to do but no sense of how to do them or in what order. Simply put, we don’t yet know what the history of computing is really about.’ (Mahoney, 1993) Three possible statements can be deducted from this observation: • We don’t know what history is about; • We don’t know what humanities computing is about; • We don’t know what the history of humanities computing is about. This paper aims at addressing these three basic questions by sketching out prolegomena to the history of humanities computing. References ALPAC (1966) Languages and machines: computers in translation and linguistics. A report by the Automatic Language Processing Advisory Committee, Division of Behavioral Sciences, National Academy of Sciences, National Research Council. Washington, D.C.: National Academy of Sciences, National Research Council (Publication 1416). <http://www.nap.edu/books/ARC000005/html/> De Tollenaere, Felicien (1976). Word-Indices and Word-Lists to the Gothic Bible and Minor Fragments. Leiden: E. J. Brill. Dening, Greg (1996). Performances. Chicago: University of Chicago Press. Hockey, Susan (1980). A Guide to Computer Applications in the Humanities. London: Duckworth. Hockey, Susan (1998). An Agenda for Electronic Text Technology in the Humanities. Classical World, 91: 521-42. Hockey, Susan (2000). Electronic Texts in the Humanities. Principles and Practice. Oxford: Oxford University Press. Hockey, Susan (2004). The History of Humanities Computing. In Schreibman, Susan, Siemens, Ray, and Unsworth, John (eds.), A Companion to Digital Humanities. Malden, MA/Oxford/ Carlton,Victoria: Blackwell Publishing, p. 3-19. Southgate, Beverley (2003). Intellectual history/history of ideas. In Berger, Stefan, Feldner, Heiko, and Passmore, Kevin (eds.), Writing History.Theory & Practice. London: Arnold, p. 243-260. Southgate, Beverley (2005). Review of Marc Ferro, The Use and Abuse of History, or, How the Past is Taught. Reviews in History, 2005, 441. <http://www.history.ac.uk/reviews/reapp/southgate.html> Mahoney, Michael S. (1993). Issues in the History of Computing. Paper prepared for the Forum on History of Computing at the ACM/SIGPLAN Second History of Programming Languages Conference, Cambridge, MA, 20-23 April 1993. <http://www.princeton.edu/~mike/articles/issues/issuesfr. htm> McCarty, Willard (2004). As It Almost Was: Historiography of Recent Things. Literary and Linguistic Computing, 19/2: 161-180. Munslow, Alun (1999). The Postmodern in History: A Response to Professor O’Brien. Discourse on postmodernism and history. Institute of Historical Research. <http://www.history.ac.uk/discourse/alun.html> Zampolli, Antonio (1989). Introduction to the Special Section on Machine Translation. Literary and Linguistic Computing, 4/3: 182-184. Booth, A.D. (1967). Introduction. In Booth, A.D. Machine Translation. Amsterdam: North-Holland Publishing Company, p.VI-IX. _____________________________________________________________________________ 209 Digital Humanities 2008 _____________________________________________________________________________ Knowledge-Based Information Systems in Research of Regional History Aleksey Varfolomeyev avarf@psu.karelia.ru Petrozavodsk State University, Russian Federation Henrihs Soms henrihs.soms@du.lv Daugavpils University, Latvia Aleksandrs Ivanovs aleksandrs.ivanovs@du.lv Daugavpils University, Latvia Representation of data on regional history by means of Web sites is practised in many countries, because such Web sites should perform – and more often than not they really perform – three important functions. First, they further preservation of historical information related to a certain region. Second, Web sites on regional history promote and publicize historical and cultural heritage of a region, hence they can serve as a specific ‘visiting-card’ of a region. Third, they facilitate progress in regional historical studies by means of involving researchers in Web-community and providing them with primary data and, at the same time, with appropriate research tools that can be used for data retrieving and processing. However, implementation of the above-mentioned functions calls for different information technologies. For instance, preservation of historical information requires structuring of information related to a certain region, therefore databases technologies should be used to perform this function. In order to draft and update a ‘visiting-card’ of a region, a Web site can be designed on the basis of Content Management System (CMS). Meanwhile, coordination of research activities within the frameworks of a certain Web-community calls for implementation of modern technologies in order to create knowledge-based information systems. In this paper, a concept of a knowledge-based information system for regional historical studies is framed and a pattern of data representation, aggregation, and structuring is discussed. This pattern can be described as a specialized semantic network hereafter defined as Historical Semantic Network (HSN). Up to now, the most popular pattern of representation of regional historical information is a traditional Web site that consists of static HTML-pages. A typical example is the Web site Latgales Dati (created in 1994, see http://dau.lv/ld), which provides information about the history and cultural heritage of Eastern Latvia – Latgale (Soms 1995; Soms and Ivanovs 2002). On the Web site Latgales Dati, historical data are stored and processed according to the above-mentioned traditional pattern. The shortcomings of this pattern are quite obvious: it is impossible to revise the mode of processing and representation of data on the Web site; it is rather difficult to feed in additional information, to retrieve necessary data, and to share information. In the HEML project, a special language of representation and linkage of information on historical events has been developed (http://www.heml.org). Advantages of this language are generally known: a modern approach to processing and representation of historical information on the basis of XML, a developed system of tools of visualization of information, etc. However, the processing of information is subjected to descriptions of historical events. It can be regarded as a drawback of this language, since such descriptions do not form a solid basis for historical data representation in databases. Actually, any description of a historical event is an interpretation of narratives and documentary records. For this reason, databases should embrace diverse historical objects, namely, documents and narratives, persons, historical monuments and buildings, institutions, geographical sites, and so on, and so forth.These objects can be described using different attributes simultaneously, which specify characteristic features of the objects. Sometimes, objects can be labelled with definite, unique attributes only. The detailed, exhaustive information about the objects can be represented by means of links between objects, thus creating a specific semantic network. Within this network, paramount importance is attached to special temporal objects – the chronological ‘markers’. Connecting links between a number of historical objects, including special temporal objects, will form rather a solid basis for historical discourse upon historical events, processes, and phenomena. The first characteristic feature of the semantic network, which is being created on the basis of the Web site Latgales Dati, is the principle role assigned to such historical objects as documentary records and narratives.Thus, the connecting links between the objects are constituted (actually, reconstructed) in accordance with the evidences of historical sources. In order to provide facilities for verification of the links, the semantic network should contain full texts of historical records as well as scanned raster images of them. On the one hand, texts and images can be directly connected with the objects of the semantic network by means of SVG technology; on the other hand, this connection can be established by means of mark-up schemes used in XML technology (Ivanovs and Varfolomeyev 2005). The second characteristic feature of HSN is fuzziness of almost all aspects of information in the semantic network (Klir and Yuan 1995).The source of this fuzziness is uncertainty of expert evaluations and interpretations of historical data; moreover, evidences of historical records can be either fragmentary and contradictory, or doubtful and uncertain.Therefore, the level of validity of information is ‘measured’ by variables, which can be expressed by ordinary numbers (0–1). However, it seems that they should be expressed by vector quantities as pairs (true, false); the components true and false can assume values from 0 to 1. In this case, the pair (1,1) refers to very contradictory and unreliable information, meanwhile the pair (0,0) means _____________________________________________________________________________ 210 Digital Humanities 2008 _____________________________________________________________________________ the absence of information. This approach was used in construction of four-valued logic (Belnap 1977). The fuzziness values defined for some objects can be propagated within a semantic network by means of logical reasoning (Hähnle 1993). This ‘measurement’ of validity of data is performed by users of HSN; the results of ‘measurement’ should be changeable and complementary, thus reflecting cooperation of researchers within the frameworks of Web-community. Creating a semantic network, a number of principal operations should be performed: reconstruction, linkage, and representation. Reconstruction is either manual or automatic generation of interrelations between objects that are not directly reflected in historical records. In this stage, new historical objects can emerge; these objects are reconstructed in accordance with the system of their mutual relations. Linkage can be defined as detection of similar (in fuzzy sense) objects, which within a semantic network (or within a number of interrelated networks) form a definite unit. Representation is a number of operations of retrieving of data from the semantic network including representation of information by means of tables, graphs, timeline, geographic visualization, etc. HSN as a basic pattern of representation of historical information in Internet provides researchers with proper tools, which can easily transform data retrieved from such Web sites. However, some problems should be solved in order to ‘substitute’ the traditional Web site Latgales Dati for HSN. There are different approaches to representation of semantic networks in Internet. A generally accepted approach is division of network links into subjects, predicates, and objects followed by their representation by means of RDF. However, the RDF pattern (actually, directed graph in terms of mathematics) is not adequate to the hyper-graph pattern accepted in HSN, since in HSN one and the same link can connect different objects simultaneously. In order to represent hyper-graphs, different semantic networks – such as WordNet (Fellbaum 1998), MultiNet (Helbig 2006), topic maps (Dong and Li 2004) or even simple Wiki-texts interconnected by so-called ‘categories’ – should be used. Unfortunately, the above-mentioned semantic networks do not provide proper tools to perform the tasks set by HSN. For this purpose, the experience acquired by designers of universal knowledge processing systems (e.g. SNePS, see Shapiro 2007, or Cyc, see Panton 2006) should be taken into consideration. One more important problem is linkage of different HSN, including Latgales Dati. Such linkage can be conducive to cooperation of researchers, since it ensures remote access to historical data and exchange of information and results of research work. It seems that Web-services technology can carry out this task. In this case, HSN should be registered in a certain UDDI-server. A user’s query activates interaction of a Web-server with numerous related Web-servers, which are supplied with HSN-module. As a result, a semantic network is being generated on the basis of fragments of different networks relevant to the initial query. The principles of creation of HSN substantiated above are used in designing a new version of the knowledge-based information system Latgales Dati. As software environment a specialized Web application created at Petrozavodsk State University is applied (http://mf.karelia.ru/hsn). References Belnap, N. A Useful Four-Valued Logic. In J.M. Dunn and G. Epstein, ed. Modern Uses of Multiple-Valued Logic. Dordreht, 1977, pp. 8-37. Fellbaum, C., ed. WordNet: An Electronic Lexical Database. Cambridge, Mass., 1998. Hähnle, R. Automated Deduction in Multiple-Valued Logics. Oxford, 1993. Helbig, H. Knowledge Representation and the Semantics of Natural Language. Berlin; Heidelberg; New York, 2006. Ivanovs, A.,Varfolomeyev, A. Editing and Exploratory Analysis of Medieval Documents by Means of XML Technologies. In Humanities, Computers and Cultural Heritage. Amsterdam, 2005, pp. 155-60. Klir, G.R.,Yuan, B. Fuzzy Sets and Fuzzy Logic:Theory and Applications. Englewood Cliffs, NJ, 1995. Dong,Y., Li, M. HyO-XTM: A Set of Hyper-Graph Operations on XML Topic Map toward Knowledge Management. Future Generation Computer Systems, 20, 1 (January 2004): 81-100. Panton, K. et al. Common Sense Reasoning – From Cyc to Intelligent Assistant. In Yang Cai and Julio Abascal, eds. Ambient Intelligence in Everyday Life, LNAI 3864. Berlin; Heidelberg; New York, 2006, pp. 1-31. Shapiro, S.C. et al. SNePS 2.7 User’s Manual. Buffalo, NY, 2007 (http://www.cse.buffalo.edu/sneps/Manuals/manual27.pdf). Soms, H. Systematisierung der Kulturgeschichtlichen Materialien Latgales mit Hilfe des Computers. In The First Conference on Baltic Studies in Europe: History. Riga, 1995, pp. 34-5. Soms, H., Ivanovs, A. Historical Peculiarities of Eastern Latvia (Latgale): Their Origin and Investigation. Humanities and Social Sciences. Latvia, 2002, 3: 5-21. _____________________________________________________________________________ 211 Digital Humanities 2008 _____________________________________________________________________________ Using Wmatrix to investigate the narrators and characters of Julian Barnes’ Talking It Over Brian David Walker b.d.walker1@lancaster.ac.uk Lancaster University, UK Introduction The focus of this paper is on Wmatrix (Rayson 2008), and how the output from this relatively new corpus tool is proving useful in connecting language patterns that occur in Talking It Over - a novel by Julian Barnes - with impressions of narrators and characters. My PhD research is concerned with: 1) the role of the narrator and author in characterisation in prose fiction, in particular how the narrator intervenes and guides our perception of character – something that, in my opinion, has not been satisfactorily dealt with in existing literature; and 2) how corpus tools can be usefully employed in this type of investigation. In this paper I will show that Wmatrix helps to highlight parts of the novel that are important insofar as theme and/or narratorial style and/or characterisation. I will go on to discuss difficulties I encountered during my investigation, and possible developments to Wmatrix that could be beneficial to future researchers. Of course, computer-assisted approaches to the critical analyses of novels are no longer completely new, and corpus stylistics (as it is sometimes known) is a steadily growing area of interest. Indeed, corpus approaches have already been used to investigate character in prose and drama. McKenna and Antonia (1996), for example, compare the internal monologues of three characters (Molly, Stephen and Leopold) in Joyce’s Ulysses, testing the significance on the most frequent words in wordlists for each character, and showing that common word usage can form “[…] distinct idioms that characterize the interior monologues […]” (McKenna and Antonia 1996:65). Also, more recently, Culpeper (2002) uses the ‘KeyWord’ facility in WordSmith Tools (Scott 1996) to demonstrate how the analysis of characters’ keywords in Shakespeare’s Romeo and Juliet can provide data to help establish lexical and grammatical patterns for a number of the main characters in the play. While my research adopts similar approaches to the two examples above, it takes a small yet useful step beyond wordlevel analysis, focussing instead on semantic analysis and keyconcepts. This is possible because Wmatrix can analyse a text at the semantic level (see below). Wmatrix Wmatrix (Rayson 2008) is a web-based environment that contains a number of tools for the analyses of texts. Plain-text versions of texts can be uploaded to the Wmatrix web-server where they are automatically processed in three different ways: 1) Word level – frequency lists of all the words in a text are compiled and can be presented either in order of frequency or alphabetically; 2) Grammatical level – using the Constituent Likelihood Automatic Word-tagging System (CLAWS – see Garside 1987, 1996; Leech, Garside and Bryant 1994; and Garside and Smith 1997) developed by the University Centre for Computer Corpus Research on Language (UCRELi) at Lancaster University, every word in a text is assigned a tag denoting the part-of-speech (POS) or grammatical category to which the word belongs. The words are then presented in list form either alphabetically (by POS tag) or by frequency (the most frequent POS tags at the top of the list); 3) Semantic level – every word is semantically tagged using the UCREL semantic analysis system (USASii) and then listed in order of frequency or alphabetically by semantic tag. USAS groups words together that are conceptually related. It assigns tags to each word using a hierarchical framework of categorization, which was originally based on MacArthur’s (1981) Longman Lexicon of Contemporary English. Tags consist of an uppercase letter, which indicates the general discourse field, of which there are 21 – see Table 1. Table 1 The 21 top level categories of the USAS tagset Ageneral & abstract terms Bthe body & the individual Carts & crafts Eemotion Ffood & farming Ggovernment & public Harchitecture, housing & the home Imoney & commerce (in industry) Kentertainment Llife & living things Mmovement, location, travel, transport Nnumbers & measurement Osubstances, materials, objects, equipment Peducation Qlanguage & communication Ssocial actions, states & processes Ttime Wworld & environment Xpsychological actions, states & processes Yscience & technology Znames & grammar The uppercase letter is followed by a digit, which indicates the first sub-division of that field. The tag may also include a decimal point followed by a further digit to indicate finer subdivision of the field. For example, the major semantic field of GOVERNMENT AND THE PUBLIC DOMAIN is designated by the letter G. This major field has three subdivisions: GOVERNMENT, POLITICS AND ELECTIONS – tagged G1; CRIME, LAW AND ORDER – tagged G2; and WARFARE, DEFENCE AND THE ARMY – tagged G3. The first subdivision (G1 - GOVERNMENT, POLITICS AND ELECTIONS) is further divided into: GOVERNMENT ETC. – which is has the tag G1.1; and POLITICS – which is tagged G1.2. _____________________________________________________________________________ 212 Digital Humanities 2008 _____________________________________________________________________________ From these initial lists, further analyses and comparisons are possible within the Wmatrix environment. However, the focus of this paper will be on my analysis of the USAS (semantic) output from Wmatrix, as this output proves to be the most interesting and the most useful interpretatively. draw attention to important themes within the narrations, and to styles of narration which can then be linked to character traits. Included in my discussion will be issues concerning Wmatrix’s lexicon and the type of automated semantic analysis Wmatrix carries out. Data Conclusions Talking It Over, by Julian Barnes, is a story with a fairly familiar theme – a love triangle – that is told in a fairly unusual way – there are nine first person narrators. This offers interesting data with regard to the interaction between character and narrator, as the story is told from a number of different perspectives, with the different narrators often commenting on the same events from different angles. Of the nine narrators Oliver, Stuart and Gillian are the three main ones, not only because they are the three people involved in the love triangle, but also because they say substantially more than any of the other narrators in the novel. The six other narrators are people whose involvement in the story is more peripheral, but come into contact with one or more of the main narrators. Within the novel each contribution by different narrators is signalled by the narrator’s name appearing in bold-type at the beginning of the narration. Some chapters consist of just one contribution from one narrator while others consist of several “narrations”. The three main narrators make contributions of varying lengths throughout the novel. Interpretative conclusions Approach The exploration of narrators described in this paper adopts a similar approach to that used by Culpeper (2002) in his analysis of characters in Romeo and Juliet. That is, I compare the words of one narrator with the words of all the other narrators combined, using Wmatrix in its capacity as a tool for text comparison. This comparison produced a number of lists ranked by statistical significance.The measure of statistical significance used by Wmatrix is log-likelihood (LL). The process of comparison involved, to some extent, disassembling the novel, in order to extract, into separate text files, the narrated words of each of the three main narrators. The disassembly process had to take into account the fact that within the different narrations there were the words of other people and characters. That is to say, the narrations contained speech, writing and thought presentation (SW&TP). This needed to be accounted for in the analysis in order to be as precise as possible about descriptions and impressions of narrators based on the Wmatrix output. The best way to achieve this was to produce a tagged version of the novel, from which relevant portions could be extracted more exactly using computer-tools. These extracted portions could then be analysed using Wmatrix. My paper will discuss the tagging process, the Wmatrix output and the subsequent analysis and show how key semantic concepts highlighted for each of the three main characters The list of semantic groups produced by Wmatrix for each of the three main narrators show a number of differences in narrator characteristics and narratorial styles. Stuart’s list contains semantic groups that relate to his job.An investigation of highly significant categories in the list showed that Stuart’s attitude relating to particular themes or concepts change during the novel. For example, Stuart’s relationship with money alters, which is reflected in a change of attitude toward love. Stuart also becomes less worried about disappointing people and more concerned about not being disappointed himself. An investigation of items in Gillian’s list of semantic categories identified a more conversational style to her narration when compared to the rest of the narrators, as well as a determination to give an account of the story that was accurate and complete. The top item in Oliver’s list of categories contains the words that Wmatrix could not match. While on one hand this could be seen as a short-coming of Wmatrix, in the case of this study, the result gave a clear indication of the number of unusual, foreign and hyphenated words Oliver uses, and showed that he has very broad vocabulary as well as knowledge of a wide variety of topics. The list of unmatched words also highlighted Oliver’s creativity with language, and that creativity is very much part of his style. He uses unusual, technical and poetic words in place of more commonplace synonyms and this could be seen as evidence towards Oliver being showy and flamboyant. Wmatrix development The results from this study, in particular those relating to Oliver’s narration and the failure of Wmatrix to successfully categorise many of the words used in it, raise issues regarding the tool’s lexicon and the semantic categories it uses. These issues will be discussed as potential avenues for the development of this useful corpus tool. For instance, the Wmatrix-lexicon is currently progressively updated and expanded, meaning that, in practice, there is no one static point from which comparisons of texts can be made. While there is a case for a lexicon that is as comprehensive as possible, the present way of managing the lexicon can cause difficulties for research projects that continue over an extended period of time. However, creating a ‘standard’ fixed lexicon is not without difficulties and raises questions about what counts as ‘standard’. Even though Wmatrix allows users to define their own lexicon, a possible way forward might be to have multiple _____________________________________________________________________________ 213 Digital Humanities 2008 _____________________________________________________________________________ lexicons, such as a scientific lexicon or a learner-English lexicon, which researchers could select depending on their research needs. Notes i For further information about UCREL see http://www.comp.lancs. ac.uk/ucrel/ ii For further information about USAS, see http://www.comp.lancs. ac.uk/ucrel/usas/ References Culpeper, J. (2002) “Computers, language and characterisation: An Analysis of six characters in Romeo and Juliet.” In: U. Melander-Marttala, C. Ostman and Merja Kytö (eds.), Conversation in Life and in Literature: Papers from the ASLA Symposium, Association Suedoise de Linguistique Appliquee (ASLA), 15. Universitetstryckeriet: Uppsala, pp.11-30. Garside, R. (1987). The CLAWS Word-tagging System. In: R. Garside, G. Leech and G. Sampson (eds), The Computational Analysis of English: A Corpus-based Approach. London: Longman. Garside, R. (1996). The robust tagging of unrestricted text: the BNC experience. In J. Thomas and M. Short (eds) Using corpora for language research: Studies in the Honour of Geoffrey Leech. Longman, London, pp 167-180. Garside, R., and Smith, N. (1997) A hybrid grammatical tagger: CLAWS4, in Garside, R., Leech, G., and McEnery, A. (eds.) Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London, pp. 102-121. Leech, G., Garside, R., and Bryant, M. (1994). CLAWS4: The tagging of the British National Corpus. In Proceedings of the 15th International Conference on Computational Linguistics (COLING 94). Kyoto, Japan, pp622-628. McKenna, W. and Antonia, A. (1996) “ ‘A Few Simple Words’ of Interior Monologue in Ulysses: Reconfiguring the Evidence”. In Literary and Linguistic Computing 11(2) pp55-66 Rayson, P. (2008) Wmatrix: a web-based corpus processing environment, Computing Department, Lancaster University. http://www.comp.lancs.ac.uk/ucrel/wmatrix/ Rayson, P. (2003). Matrix: A statistical method and software tool for linguistic analysis through corpus comparison. Ph.D. thesis, Lancaster University. Scott, M. (1996) Wordsmith. Oxford:Oxford University Press The Chymistry of Isaac Newton and the Chymical Foundations of Digital Humanities John A. Walsh jawalsh@indiana.edu Indiana University, USA My paper will examine both a specific digital humanities project, The Chymistry of Isaac Newton <http://www.chymistry.org/>, and reflect more broadly on the field of digital humanities, by suggesting useful parallels between the disciplines of alchemy and digital humanities. The Chymistry of Isaac Newton project is an effort to digitize and edit the alchemical writings of Newton and to develop digital scholarly tools (reference, annotation, and visualization) for interacting with the collection. Newton’s “chymistry” has recently become a topic of widespread public interest.[1] It is no longer a secret that the pre-eminent scientist of the seventeenth century spent some thirty years working in his alchemical laboratory, or that he left a manuscript Nachlass of about 1,000,000 words devoted to alchemy. Newton’s long involvement in chymistry figures prominently in NOVA’s 2005 “Newton’s Dark Secrets,” and it occupies a major portion of the documentary’s website at <http://www.pbs.org/wgbh/nova/newton/>. Even more attention was devoted to Newton’s alchemy in the 2003 BBC production “Newton: The Dark Heretic.” Newton’s chymistry also is featured in recent popularizing studies such as Gleick’s 2003 Isaac Newton and White’s 1997 Isaac Newton: the Last Sorcerer. Despite the fascination that Newton’s chymistry holds for the public, the subject has not received a corresponding degree of scrutiny from historians since the untimely passing of B.J.T. Dobbs in 1994 and Richard Westfall in 1996. Dobbs had made the issue of Newton’s alchemy a cause célèbre in her influential historical monograph, The Foundations of Newton’s Alchemy of 1975, and Westfall built upon her work in his own magisterial biography of Newton, Never at Rest , of 1980. The relative lack of subsequent scholarly grappling with Newton’s alchemy is regrettable for many reasons, particularly since Dobbs and Westfall raised strong claims about the relationship of his alchemical endeavors to his physics. In particular, they suggested that Newton’s concept of force at a distance was strongly influenced by his work in alchemy, and that it was alchemy above all else that weaned Newton away from the Cartesian mechanical universe in favor of a world governed by dynamic interactions operating both at the macrolevel and the microlevel. Although Dobbs backed away from these earlier positions in her 1991 Janus Faces of Genius, she made the equally strong claim there that Newton’s alchemy was primarily concerned with the operations of a semi-material ether that acted as God’s special agent in the material world. Westfall too emphasized the putative connections between Newton’s religion and his alchemy. _____________________________________________________________________________ 214 Digital Humanities 2008 _____________________________________________________________________________ Interestingly, the historical speculations of Westfall, Dobbs, and their popularizers have relatively little to say about the relationship of Newton’s alchemy to the alchemical tradition that he inherited. Perhaps because of the underdeveloped state of the historiography of alchemy in the 1970s, both Westfall and Dobbs portrayed Newton’s seemingly obsessive interest in the subject as something exotic that required the help of extradisciplinary motives to explain it. Hence Dobbs looked beyond chymistry itself, invoking the aid of Jungian psychology in her Foundations (1975) and that of Newton’s heterodox religiosity in her Janus Faces (1991). In neither case did she see chymistry as a field that might have attracted Newton on its own merits. Recent scholarship, however, has thrown a very different picture on alchemy in the seventeenth-century English speaking world.We now know that Newton’s associate Robert Boyle was a devoted practitioner of the aurific art, and that the most influential chymical writer of the later seventeenth century, Eirenaeus Philalethes, was actually the Harvard-trained experimentalist George Starkey, who tutored Boyle in chymistry during the early 1650s. Boyle’s “literary executor,” the empiricist philosopher John Locke, was also deeply involved in chrysopoetic chymistry, and even Newton’s great mathematical rival, Leibniz, had an abiding interest in the subject. In short, it was the norm rather than the exception for serious scientific thinkers of the seventeenth century to engage themselves with chymistry.We need no longer express amazement at the involvement of Newton in the aurific art, but should ask, rather, how he interacted with the practices and beliefs of his predecessors and peers in the burgeoning field of seventeenth-century chymistry. A first step in this endeavor, obviously, lies in sorting out Newton’s chymical papers and making them available, along with appropriate digital scholarly tools, for systematic analysis. The most recent phase of the project focuses on digital tools, including a digital reference work based on Newton’s Index Chemicus, an index, with bibliographic references, to the field and literature of alchemy. Newton’s Index includes 126 pages with 920 entries, including detailed glosses and bibliographic references. Our online reference work edition of Newton’s Index will not be constrained by Newton’s original structure and will extend functionality found in traditional print-based reference works. It will leverage information technologies such as searching and cross-linking to serve as an access point to the domain of seventeenth-century alchemy and a portal to the larger collection and to visualizations that graphically plot the relative occurrences of alchemical terms, bibliographic references, and other features of the collection. Other tools being developed include systems for user annotation of XMLencoded texts and facsimile page images. My presentation will look at current progress, developments, and challenges of the Chymistry of Isaac Newton. With this project as context, I will venture into more theoretical areas and examine parallels between alchemy and digital humanities. For example, like digital humanities, alchemy was an inherently interdisciplinary field. Bruce T. Moran, in his Distilling Knowlege describes alchemy as an activity “responding to nature so as to make things happen without necessarily having the proven answer for why they happen” (10). This approach has a counterpart in the playful experimentation and affection for serendipitious discovery found in much digital humanities work. Moran also explains that the “primary procedure” of alchemy was distillation, the principle purpose of which was to “make the purest substance of all, something linked, it was thought, to the first stuff of creation” (11). Similarly, much of digital humanities work is a process of distillation in which visualizations or XML trees, for instance, are employed to reveal the “purest substance” of the text or data set. Codes and symbols and metadata were important parts of the alchemical discipline, as they are in digital humanities. Alchemy also had a precarious place in the curriculum. Moran, again, indicates that “although a sprinkling of interest may be found in the subject within the university, it was, as a manual art, always denied a part in the scholastic curriculum” (34). Digital Humanities is likewise often cordoned off from traditional humanities departments, and its practice and research is conducted in special research centers, newly created departments of digital humanities, or in schools of information science.The idea of the alchemist as artisan and the digital humanist as technician is another interesting parallel between the two fields. My paper will examine these and other similarities between the two fields and examine more closely the works of individual alchemists, scientists, and artists in this comparative context. Alchemy is an interdisciplinary field-like much digital humanities work--that combines empirical science, philosophy, spirituality, literature, myth and magic, tools and totems. I will argue that, as we escape the caricature of alchemy as a pseudo-science preoccupied with the occult, alchemy can serve as a useful model for future directions in digital humanities. [1]As William R. Newman and Lawrence M. Principe have argued in several co-authored publications, it is anachronistic to distinguish “alchemy” from “chemistry” in the seventeenth century. The attempt to transmute metals was a normal pursuit carried out by most of those who were engaged in the varied realm of iatrochemistry, scientific metallurgy, and chemical technology. The fact that Newton, Boyle, Locke, and other celebrated natural philosophers were engaged in chrysopoeia is no aberration by seventeenth-century standards. Hence Newman and Principe have adopted the inclusive term “chymistry,” an actor’s category employed during the seventeenth century, to describe this overarching discipline. See William R. Newman and Lawrence M. Principe, “Alchemy vs. Chemistry: The Etymological Origins of a Historiographic Mistake,” Early Science and Medicine 3(1998), pp. 32-65. Principe and Newman, “Some Problems with the Historiography of Alchemy,” in William R. Newman and Anthony Grafton, Secrets of Nature: Astrology and Alchemy in Early Modern Europe (Cambridge, MA: MIT Press, 2001), pp. 385-431. _____________________________________________________________________________ 215 Digital Humanities 2008 _____________________________________________________________________________ References Dobbs, Betty Jo Teeter. The Foundations of Newton’s Alchemy or, “The Hunting of the Greene Lyon”. Cambridge: Cambridge UP, 1975. - --. The Janus Faces of Genius:The Role of Alchemy in Newton’s Thought. Cambridge: Cambridge UP, 1991. Gleick, James. Isaac Newton. New York: Pantheon, 2003. Moran, Bruce T. Distilling Knowledge: Alchemy, Chemistry, and the Scientific Revolution. Cambridge: Harvard UP, 2005. Newman, William R. Promethean Ambitions: Alchemy and the Quest to Perfect Nature. Chicago: U of Chicago P, 2004. Newton, Isaac. The Chymistry of Isaac Newton. Ed. William R. Newman. 29 October 2007. Library Electronic Text Resource Service / Digital Library Program, Indiana U. 25 November 2007. <http://www.chymistry.org/>. Westfall, Richard. Never at Rest: A Biography of Isaac Newton. Cambridge: Cambridge UP, 1980. White, Michael. Isaac Newton:The Last Sorcerer. Reading: Addison-Wesley, 1997. Document-Centric Framework for Navigating Texts Online, or, the Intersection of the Text Encoding Initiative and the Metadata Encoding and Transmission Standard John A. Walsh jawalsh@indiana.edu Indiana University, USA Michelle Dalmau mdalmau@indiana.edu Indiana University, USA Electronic text projects range in complexity from simple collections of page images to bundles of page images and transcribed text from multiple versions or editions (Haarhoff, Porter). To facilitate navigation within texts and across texts, an increasing number of digital initiatives, such as the Princeton University Library Digital Collections (http://diglib.princeton. edu/) and the Oxford Digital Library (http://www.odl.ox.ac. uk/), and cultural heritage organizations, such as the Culturenet Cymru (http://www.culturenetcymru.com) and the Library of Congress (http://www.loc.gov/index.html), are relying on the complementary strengths of open standards such as the Text Encoding Initiative (TEI) and the Metadata Encoding and Transmission Standard (METS) for describing, representing, and delivering texts online. The core purpose of METS is to record, in a machine-readable form, the metadata and structure of a digital object, including files that comprise or are related to the object itself. As such, the standard is a useful tool for managing and preserving digital library and humanities resources (Cantara; Semple). According to Morgan Cundiff, Senior Standards Specialist for the Library of Congress, the maintenance organization for the METS standard: METS is an XML schema designed for the purpose of creating XML document instances that express the hierarchical structure of digital library objects, the names and locations of the files that comprise those digital objects, and the associated descriptive and administrative metadata. (53) Similarly, the TEI standard was developed to capture both semantic (e.g., metadata) and syntactic (e.g., structural characteristics) features of a document in machine-readable form that promotes interoperability, searchability and textual analysis. According to the Text Encoding Initiative web site, the TEI standard is defined thusly: _____________________________________________________________________________ 216 Digital Humanities 2008 _____________________________________________________________________________ The TEI Guidelines for Electronic Text Encoding and Interchange define and document a markup language for representing the structural, renditional, and conceptual features of texts. They focus (though not exclusively) on the encoding of documents in the humanities and social sciences, and in particular on the representation of primary source materials for research and analysis. These guidelines are expressed as a modular, extensible XML schema, accompanied by detailed documentation, and are published under an opensource license. (“TEI Guidelines”) METS, as its name suggests, is focused more exclusively on metadata. While digital objects—such as a text, image, or video—may be embedded within a METS document, METS does not provide guidelines, elements and attributes for representing the digital object itself; rather, the aim of METS is to describe metadata about a digital object and the relationships among an object’s constituent parts. In sum, METS is datacentric; TEI is document-centric. In April 2006, the Indiana University Digital Library Program released a beta version of METS Navigator (http:// metsnavigator.sourceforge.net/), a METS-based, open source software solution for the discovery and display of multi-part digital objects. Using the information in the METS structural map elements, METS Navigator builds a hierarchical menu that allows users to navigate to specific sections of a document, such as title page, specific chapters, illustrations, etc. for a book. METS Navigator also allows simple navigation to the next, previous, first, and last page image or component parts of a digital object. METS Navigator can also make use of the descriptive metadata in the METS document to populate the interface with bibliographic and descriptive information about the digital object. METS Navigator was initially developed at Indiana University (IU) for the online display and navigation of brittle books digitized by the IU Libraries’ E. Lingle Craig Preservation Laboratory. However, realizing the need for such a tool across a wide range of digital library projects and applications, we designed the system to be generalizable and configurable.To assist with the use of METS Navigator for new projects, a METS profile, also expressed in XML, is registered with the Library of Congress (http://www.loc.gov/standards/ mets/profiles/00000014.html). The profile provides detailed documentation about the structure of the METS documents required by the METS Navigator application. from which a full METS document can be automatically generated, facilitating the display of text collections in METS Navigator.The TEI, much like METS, provides a rich vocabulary and framework for encoding descriptive and structural metadata for a variety of documents.The descriptive metadata typically found in the TEI Header may be used to populate the corresponding components (descriptive metadata section) of a METS document. The embedded structural metadata that describe divisions, sections, headings and page breaks in a TEI document may be used to generate the structural map section of a METS document. Unlike the TEI, the METS scheme has explicit mechanisms in place for expressing relationships between multiple representations of the digital content such as encoded text files and page images. By integrating the TEIMETS workflow into METS Navigator, scholars and digital library and humanities programs can more easily implement online text collections. Further, the intersection between TEI and METS documents can provide the foundation for enhanced end-user exploration of electronic texts. The IU Digital Library Program as a result of enhancing the functionality and modularity of the METS Navigator software is also in the process of formalizing and integrating the TEIMETS workflow in support of online page turning. Our paper will trace the development of METS Navigator including the TEI-METS workflow, demonstrate the METS Navigator system using TEI-cum-METS documents, review METS Navigator configuration options, detail findings from recent user studies, and outline plans for current and future development of the METS Navigator. Cantara reported in her brief article, “The Text-Encoding Initiative: Part 2,” that discussion of the relationship between the TEI and METS was a primary focus during the Fourth Annual TEI Consortium Members’ Meeting held in 2004 at Johns Hopkins University (110). The relationship between the standards was also a topic of discussion during the 2007 Members’ Meeting, and was specifically raised in Fotis Jannidis’ plenary entitled “TEI in a Crystal Ball.” Despite these discussions, the community is still lacking a welldocumented workflow for the derivation of METS documents from authoritative TEI files. We believe the TEI, if properly structured, can be used as the “master” source of information _____________________________________________________________________________ 217 Digital Humanities 2008 _____________________________________________________________________________ References Cantara, Linda. “Long-term Preservation of Digital Humanities Scholarship.” OCLC Systems & Services 22.1 (2006): 38-42. Cantara, Linda. “The Text-encoding Initiative: Part 2.” OCLC Systems & Services 21.2 (2005): 110-113. iTrench: A Study of the Use of Information Technology in Field Archaeology Claire Warwick c.warwick@ucl.ac.uk University College London, UK Cundiff, Morgan V. “An introduction to the Metadata Encoding and Transmission Standard (METS).” Library Hi Tech 22.1 (2004): 52-64. Melissa Terras Haarhoff, Leith. “Books from the Past: E-books Project at Culturenet Cymru.” Program: Electronic Library and Information Systems 39.1 (2004): 50-61. Claire Fisher Porter, Dorothy. “Using METS to Document Metadata for Image-Based Electronic Editions.” Joint International Conference of the Association for Computers and the Humanities and the Association for Literary and Linguistic Computing, June 2004. Göteborg: Centre for Humanities Computing (2004): 102-104. Abstract. 18 November 2007 <http://www.hum.gu.se/allcach2004/AP/html/prop120.html>. Semple, Najla. “Develpoing a Digital Preservation Strategy at Edinburgh University Library.” VINE: The Journal of Information and Knowledge Management Systems 34.1 (2004): 33-37. “Standards, METS.” The Library of Congress. 18 November 2007 <http://www.loc.gov/standards/mets/>. “TEI Guidelines.” The Text Encoding Initiative. 18 November 2007 < http://www.tei-c.org/Guidelines/>. m.terras@ucl.ac.uk University College London, UK c.fischer@ucl.ac.uk University College London, UK Introduction This paper presents the results of a study by the VERA project (Virtual Research Environment for Archaeology: http://vera. rdg.ac.uk ) which aimed to investigate how archaeologists use information technology (IT) in the context of a field excavation. This study was undertaken by researchers at School of Library, Archive and Information Studies, University College London, who collaborate on the project with the School of Systems Engineering, and the Department of Archaeology, University of Reading. VERA is funded by the JISC Virtual Research Environments Programme, Phase 2, (http://www.jisc.ac.uk/whatwedo/ programmes/vre2.aspx) and runs from April 2007 until March 2009. We aim to produce a fully operational virtual research environment for the archaeological community. Our work is based on a research excavation of part of the large Roman town at Silchester, which aims to trace the site’s development from its origins before the Roman Conquest to its abandonment in the fifth century A.D (Clarke et al 2007). This complex urban site provides the material to populate the research environment, utilising the Integrated Archaeological Data Base (IADB: http://www.iadb.co.uk/specialist/it.htm), an online database system for managing recording, analysis, archiving and online publication of archaeological finds, contexts and plans. The dig allows us to: study the use of advanced IT in an archaeological context; investigate the tasks carried out within archaeological excavations; ascertain how and where technology can be used to facilitate information flow within a dig; and inform the designers of the IADB how it may be adapted to allow integrated use of the tools in the trench itself. Research Context Although archaeologists were quick to embrace IT to aid in research analysis and outputs (Laflin 1982, Ross et al 1990, Reilly and Rahtz 1992), and the use of IT is now central to _____________________________________________________________________________ 218 Digital Humanities 2008 _____________________________________________________________________________ the manipulation and display of archaeological data (Lock and Brown 2000, McPherron and Dibble, 2002, Lock 2003) the use of IT to aid field archaeology is in its relative infancy due to the physical characteristics of archaeological sites, and the difficulties of using IT in the outdoor environment. Whilst the use of electronic surveying equipment (total stations, (Eiteljorg, 1994)) and digital cameras is now common place on archaeological sites there are relatively few archaeological organizations that use digital recording methods to replace the traditional paper records which rely on manual data input at a (sometimes much) later date. With ever increasing amounts of data being generated by excavations onsite databases are becoming increasingly necessary, for example at the excavations at Catalhoyuk in Turkey (http://www.catalhoyuk.com/database/ catal/) and the site of Terminal 5 at Heathrow airport (http:// www.framearch.co.uk/t5/), and some archaeologists have begun to use digital data input from total stations, PDAs, tablet PCs, digital cameras, digital callipers, digital pens and barcodes (McPherron and Dibble, 2003, Dibble et al., 2007). For many excavations, however, the use of IT is restricted to the analysis stage rather than field recording. The aim of the VERA project is to investigate the use of IT within the context of a field excavation and to ascertain if, and how, it may be appropriated to speed up the process of data recording, entry and access. Method We used a diary study to gather information about the work patterns of different archaeological roles and the way that they are supported by both digital and analogue technologies. The study was carried out by the UCL team, at the Silchester dig during the summer of 2007. (http://www.rdg.ac.uk/AcaDepts/ la/silchester/publish/field/index.php) Researchers from Reading also carried out a study into the use of ICT hardware to support digging and data entry. A detailed record of the progress of both the dig and the study was kept on the VERA blog (http://vera.rdg.ac.uk/blog). Diary studies enable researchers to understand how people usually work and can be used to identify areas that might be improved by the adoption of new working practices or technologies. (O’Hara et al. 1998). They have been used in the area of student use of IT, and to study the work of humanities scholars. (Rimmer et. al. Forthcoming) however, this is the first use of this method to study field archaeology that we are aware of. During diary studies, participants are asked to keep a detailed record of their work over a short period of time. The participant records the activity that they are undertaking, what technologies they are using and any comments they have on problems or the progress of their work. This helps us to understand the patterns of behaviour that archaeologists exhibit, and how technology can support these behaviours. We also obtained contextual data about participants using a simple questionnaire. This elicited information about the diary survey participants (role, team, status) and their experience of using the technology on site. A cross section of people representing different types of work and levels of experience were chosen. For example we included inexperienced and experienced excavators; members of the finds team, who process the discoveries made on site; those who produce plans of the site and visitor centre staff. A defined area of the Silchester site was used to test the use of new technologies to support excavation. In this area archaeologists used digital pens and paper, (http://www. logitech.com/index.cfm/mice_pointers/digital_pen/devices/ 408&cl=us,en ) digital cameras, and Nokia N800 PDAs (http:// www.nokia.co.uk/A4305204). Diaries from this area were compared to those using traditional printed context sheets to record their work. Findings This paper will present the findings from the study, covering various issues such as the attitude of archaeologists to diary studies, previous and present experiences of using technology on research excavations, the effect of familiarity of technology on uptake and use, and resistance and concerns regarding the use of technology within an archaeological dig. We also evaluate specific technologies for this purpose, such as the Nokia N800 PDA, and Logitech Digital Pens, and ascertain how IT can fit into existing workflow models to aid archaeologists in tracking information alongside their very physical task. Future work This year’s diary study supplied us with much interesting data about the past, current and potential use of IT in the trench. We will repeat the study next year to gain more detailed data. Participants will be asked to focus on a shorter period of time, one or two days, as opposed to five this year. Next year we will have a research assistant on site, allowing us to undertake interviews with participants, clarify entries, and build up a good working relationship with experts working on the excavation.Two periods of diary study will also be undertaken, allowing for analysis and refining methods between studies. This will also be juxtaposed with off-site user testing and analysis workshops of the IADB, to gain understanding of how archaeologists use technology both on and off site. We also plan to run an additional training session in the use of ICT hardware before the start of next year’s dig, in addition to the usual archaeological training. Conclusion The aim of this study is to feed back evidence of use of IT to the team developing the virtual environment interface of the IADB. It is hoped that by ascertaining and understanding user needs, being able to track and trace information workflow throughout the dig, and gaining an explicit understanding of the tasks undertaken by archaeologists that more intuitive _____________________________________________________________________________ 219 Digital Humanities 2008 _____________________________________________________________________________ technologies can be adopted, and adapted, to meet the needs of archaeologists on site, and improve data flow within digs themselves. Acknowledgements We would like to acknowledge the input of the other members of the VERA project team, Mark Barker, Matthew Grove (SSE, University of Reading) Mike Fulford, Amanda Clarke, Emma O’Riordan (Archaeology, University of Reading), and Mike Rains, (York Archaeological Trust). We would also like to thank all those who took part in the diary study. Rimmer, J., Warwick, C., Blandford, A., Buchanan, G., Gow, J. An examination of the physical and the digital qualities of humanities research. Information Processing and Management (Forthcoming) Ross, S. Moffett, J. and J. Henderson (Eds) (1990). Computing for Archaeologists. Oxford, Oxford University School of Archaeology. References Clarke, A., Fulford, M.G., Rains, M. and K. Tootell. (2007). Silchester Roman Town Insula IX: The Development of an Urban Property c. AD 40-50 - c. AD 250. Internet Archaeology, 21. http://intarch.ac.uk/journal/issue21/silchester_index.html Dibble, H.L., Marean, C.W. and McPherron (2007) The use of barcodes in excavation projects: examples from Mosel Bay (South Africa) an Roc de Marsal (France). The SAA Archaeological Record 7: 33-38 Eiteljorg II, H. (1994). Using a Total Station. Centre for the Study of Architecture Newsletter, II (2) August 1994. Laflin, S. (Ed). (1982). Computer Applications in Archaeology. University of Birmingham, Centre for Computing & Computer Science. Lock, G. (2003). Using Computers in Archaeology. London: Routledge. Lock, G. and Brown, K. (Eds). (2000). On the Theory and Practice of Archaeological Computing. Oxford University School of Archaeology McPherron, S.P. and Dibble H.L. (2002) Using computers in archaeology: a practical guide. Boston, [MA]; London: McGrawHill/Mayfield. See also their website http://www.oldstoneage. com/rdm/Technology.htm McPherron, S.P. and Dibble H.L. (2003) Using Computers in Adverse Field Conditions: Tales from the Egyptian Desert. The SAA Archaeological Record 3(5):28-32 O’Hara, K., Smith, F., Newman, W., & Sellen, A. (1998). Student readers’ use of library documents: implications for library technologies. Proceedings of the SIGCHI conference on Human factors in computing systems. Los Angeles, CA. New York, NY, ACM Press/Addison-Wesley. Reilly, P. and S. Rahtz (Eds.) (1992). Archaeology and the Information Age: A global perspective. London: Routledge, One World Archaeology 21. _____________________________________________________________________________ 220 Digital Humanities 2008 _____________________________________________________________________________ LogiLogi: A Webplatform for Philosophers Wybo Wiersma wybo@logilogi.org University of Groningen,The Netherlands Bruno Sarlo brunosarlo@gmail.com Overbits, Uruguay LogiLogi is a hypertext platform featuring a rating-system that tries to combine the virtues of good conversations and the written word. It is intended for all those ideas that you’re unable to turn into a full sized journal paper, but that you deem too interesting to leave to the winds. It’s central values are openness and quality of content, and to combine these values it models peer review and other valuable social processes surrounding academic writing (in line with Bruno Latour). Contrary to early websystems it does not make use of forumthreads (avoiding their many problems), but of tags and links that can also be added to articles by others than the original author. Regardless of our project, the web is still a very young medium, and bound to make a change for philosophy in the long run. Introduction The growth of the web has been rather invisible for philosophy so far, and while quite some philosophizing has been done about what the web could mean for the human condition, not much yet has been said about what it could mean for philosophy itself (ifb; Nel93; Lev97, mainly). An exception is some early enthusiasm for newsgroups and forums in the nineties, but that quickly died out when it became apparent that those were not suitable at all for in-depth philosophical conversations. The web as a medium however is more than these two examples of early web-systems, and in the meantime it has further matured with what some call Web 2.0, or social software (sites like MySpace, Del.icio.us and Wikipedia). Time for a second look. . . LogiLogi Manta (Log), the new version of LogiLogi, is a webplatform that hopes to — be it informally and experimentally—allow philosophers and people who are interested in philosophy to use the possibilities that the internet has in stock for them too. It was started with a very small grant from the department of Philosophy of the Rijksuniversiteit Groningen. It is Free Software, has been under development for almost 2 years, and will be online by June 2008. In the following paragraph we will explain what LogiLogi is, and in section 3 LogiLogi and the web as a new medium are embedded in the philosophical tradition. Optionally section 2 can be read only after you have become interested by reading 3. A Webplatform for Philosophers LogiLogi becomes an easy to use hypertext platform, also featuring a rating- and review-system which is a bit comparable to that found in journals. It tries to find the middle-road between the written word and a good conversation, and it’s central values are openness and quality of content. It makes commenting on texts, and more generally the linking of texts very easy. Most notably it also allows other people than the original author of an article to add outgoing links behind words, but it does not allow them to change the text itself, so the author’s intellectual responsibility is guarded. Also important is that all conversations on the platform run via links (comparable to footnotes), not via forum-threads, avoiding their associated problems like fragmentation and shallowing of the discussion. To maximize the advantages of hypertext, texts are kept short within LogiLogi, at maximum one to a few pages. They can be informal and experimental and they can be improved later on, in either of two ways:The text of the original document can be changed (earlier versions are then archived). Or secondly, links can be added inside the text, possibly only when some terms or concepts appear to be ambiguous, when questions arise, or when the text appears to arouse enough interest to make it worth of further elaboration. Links in LogiLogi can refer to documents, to versions, and — by default — to tags (words that function as categories or concepts). Articles can be tagged with one or more of these tags. Multiple articles can have the same tag, and when a link is made to a tag or to a collection of tags, multiple articles can be in the set referred to. From this set the article with the highest rating is shown to the user. In essence one can rate the articles of others by giving them a grade. The average of these grades forms the rating of the article. But this average is a weighted average. Votingpowers can vary. If an authors contributions are rated well, he receives more voting-power. Authors can thus gain ’status’ and ’influence’ through their work.This makes LogiLogi a peerreviewed meritocracy, quite comparable to what we, according to Bruno Latours philosophy of science, encounter in the various structures surrounding journals (Lat87). Most notably this quality control by peer review, and it’s accompanying social encouragement, was missing from earlier web-systems. But the comparison goes further, and in a similar fashion to how new peergroups can emerge around new journals, in LogiLogi too new peergroups can be created by duplicating the just described rating-system. Contributions can be rated from the viewpoints of different peergroups, and therefore an article can have multiple ratings, authors won’t have the same voting-power within each peergroup, and visitors can pick which peergroup to use as their filter.Thus except meritocratic, LogiLogi is also open to a diversity of schools and paradigms _____________________________________________________________________________ 221 Digital Humanities 2008 _____________________________________________________________________________ in the sense of early Thomas Kuhn (Kuh96), especially as here creating new peergroups—unlike for journals—does not bring startup-costs. Plato, Free Software and Postmodernism The web is a relatively new medium, and new media are usually interpreted wrongly — in terms of old media. This is has been called the horseless carriage syndrome (McL01); according to which a car is a carriage without a horse, film records theater-plays, and—most recently—the web enables the downloading of journals. Even Plato was not exempt of this. In Phaedrus he stated that true philosophy is only possible verbally, and that writing was just an aid to memory. Regardless of this ironically enough his ’memory aid’ unleashed a long philosophical tradition (dM05). New media take their time. And we should not forget that the web is still very young (1991). Also the web is especially relevant for philosophy in that it combines conversation and writing; the two classical media of philosophy. And where previous mass-media like TV and radio were not suitable for philosophy, this was because they were one to many, and thus favored the factory model of culture (Ado91).The web on the other hand is many to many, and thereby enables something called peer to peer production (Ben06). An early example of this is Free Software: without much coordination tenthousands of volunteers have created software of the highest quality, like Linux and Firefox. Eric Raymond (Ray99) described this as a move from the cathedral- to the bazaarmodel of software development. The cathedral-model has a single architect who is responsible for the grand design, while in the bazaar-model it evolves from collective contributions. This bazaar-model is not unique for the web. It shares much with the academic tradition. The move from the book to the journal can be compared with a move in the direction of a bazaar-model. Other similarities are decentralized operation and peer-review. The only new thing of the Free Software example was it’s use of the web which — through it’s shorter turnaround times — is very suitable for peer to peer production. Another development that LogiLogi follows closely is one within philosophy itself: Jean-Franois Lyotard in his La Condition Postmoderne proclaimed the end of great stories (Lyo79). Instead he saw a diversity of small stories, each competing with others in their own domains. Also Derrida spoke of the materiality of texts, where texts and intertextuality gave meaning instead of ’pure’ ideas (Ber79; Nor87). The web in this sense is a radicalisation of postmodernism, allowing for even more and easier intertextuality. And instead of trying to undo the proliferation of paradigms, as some logic-advocates tried, and still try, we think the breakdown of language—as in further segmentation—is here to stay, and even a good thing, because it reduces complexity in the sense of Niklas Luhmann (Blo97). Take human intelligence as fixed and you see that specialized (or ’curved’ as in curved space) language allows for a more precise analysis. LogiLogi thus is explicitly modeled to allow for fine-grained specialization, and for a careful definition and discussion of terms in context. Conclusion To reiterate; LogiLogi will offer an easy to use hypertextenvironment, and thanks to it’s rating system a combination of quality and openness will be achieved: everyone can contribute, and even start new peergroups, but within peergroups quality is the determining factor. LogiLogi thus combines the informal, incremental and interactive qualities of good conversations, with conservation over time and space, as we traditionally know from the written word. LogiLogi is still very experimental. Nevertheless what we can be sure about is that the web, as a medium that has proven to be very suitable for peer to peer production and that promises increased inter-textuality and differentiation of language, is bound to make a change for philosophy in the long run; with or without LogiLogi. References [Ado91] Theodor Adorno. Culture industry reconsidered. In Theodor Adorno, editor, The Culture Industry: Selected Essays on Mass Culture, pages 98–106. Routledge, London, 1991. [Ben06] Yochai Benkler. The Wealth of Networks.Yale University Press, London, 2006. [Ber79] Egide Berns. Denken in Parijs: taal en Lacan, Foucault, Althusser, Derrida. Samsom, Alpen aan den Rijn, 1979. [Blo97] Christiaan Blom. Complexiteit en Contingentie: een kritische inleiding tot de sociologie van Niklas Luhmann. Kok Agora, Kampen, 1997. [dM05] Jos de Mul. Cyberspace Odyssee. Klement, Kampen, 2005. [ifb] http://www.futureofthebook.org. The Institute for the Future of the Book, MacArthur Foundation, University of Southern California. [Kuh96] Thomas Kuhn. The Structure of Scientific Revolutions. University of Chicago Press, Chicago, 1996. [Lat87] Bruno Latour. Science in Action. Open University Press, Cambridge, 1987. [Lev97] Pierre Levy. Collective Intelligence: Mankinds Emerging World in Cyberspace. Plenum Press, NewYork, 1997. [Log] http://foundation.logilogi.org. LogiLogi & The LogiLogi Foundation. _____________________________________________________________________________ 222 Digital Humanities 2008 _____________________________________________________________________________ [Lyo79] Jean-Franois Lyotard. La condition postmoderne: rapport sur le savoir. Les ditions de Minuit, Paris, 1979. [McL01] Marshall McLuhan. Understanding Media:The Extensions of Man. Routledge, London, 2001. [Nel93] Ted Nelson. Literary Machines: The report on, and of, Project Xanadu concerning word processing, electronic publishing, hypertekst, thinkertoys, tomorrow’s intellectual. . . including knowledge, education and freedom. Mindful Press, Sausalito, California, 1993. [Nor87] Christopher Norris. Derrida. Fontana Press, London, 1987. [Ray99] Eric S. Raymond. The cathedral & the bazaar: musings on Linux and open source by an accidental revolutionary. O’Reilly, Beijing, 1999. Clinical Applications of Computer-assisted Textual Analysis: a Tei Dream? Marco Zanasi marco.zanasi@uniroma2.it Università di Roma Tor Vergata, Italy Daniele Silvi d.silvi@email.it, silvi@lettere.uniroma2.it Università di Roma Tor Vergata, Italy Sergio Pizziconi s.p@email.it Università per Stranieri di Siena, Italy Giulia Musolino arcangelo.abbatemarco@fastwebnet.it Università di Roma Tor Vergata, Italy This research is the result of the collaboration of the Departments of Literature and Medicine of the University of Rome “Tor Vergata”. Marco Zanasi’s idea of coding and studying dream reports of his patients at Tor Vergata University Hospital Psychiatric Unit started a team research on a challenging hypothesis: Is there any correlation between linguistic realization of dream reports and the psychopathology from which the dreamer is suffering? So far, analyses of dream reports have focused mainly on actants, settings, and descriptors of emotional condition during the oneiric activity. The goal of “Dream Coding” is to collect, transcribe, catalogue, file, study, and comment on dreams reported by selected patients at Tor Vergata University Hospital in Rome. In the group of observed patients, different psychopathological statuses are represented. The aim of the project is to build up a code which would enable a different reading of the reports that patients make of their dreams. A code that in the future could also provide new tools for initial and on-going diagnoses of the psychopathology affecting the patients. Then we want to verify a set of linguistic features that can be significantly correlated to the type of psychopathology on a statistical basis. Numerous aspects, both theoretical and methodological, are discussed in this paper, ranging from the nature of the variation of language to be investigated to the tag set to be used in corpus preparation for computer analysis. The first issue that the analysis has to solve is the speech transcription and how to obtain an accurate transcoded text from the oral form. After having considered several working hypothesis we have decided to use a “servile transcription” of _____________________________________________________________________________ 223 Digital Humanities 2008 _____________________________________________________________________________ the speech: by recording the narrators’ voice and transferring it in an electronic format. At an early stage, we realized that some important elements of the speech could get lost in this process. This is why we have decided to reproduce the interruptions and hesitations. Interruptions have been classified on the basis of their length. Starting from this consideration a form (enclosed) has been created in order to catalogue the dreams and other data, such as age, sex, address, birthday, job, and so on. Furthermore the form contains information related to the transcriptor so as to detect the “noise” added by them. The noise can be partially isolated by noticing the continuity of some elements. Obtained data is stored on the computer as a word file (.rtf format) to be analyzed later on.The working process has been divided into four steps, namely: 1. first step: in this step we collect the data of the patients that already have a diagnosis (che hanno già una diagnosi?) 2. second step: in this step we analyze the dreams collected in the previous phase and point out (individuare?) recurrent lexical structures (ricorrenti?) 3. third step: in this step we collect data from all other patients in order to analyze them and propose a kind of diagnosis 4. fourth step: in this step we make a comparison between the results of the first group of patients and the second one, and try to create a “vademecum” of the particularities noticed from a linguistic point of view. We call this classification “textual diagnosis”, to be compared with the “medical diagnosis”, in order to verify pendants and incongruities. Although our research is at a preliminary stage, the statistical results on text analysis obtained so far allow us to make a few considerations. The first noticeable aspect is that differences between dreams of psychopathologic patients and controls are observed, which suggests the existence of specific features most likely correlated to the illness.We also observe a scale of narration complexity progressively decreasing from psychotic to bipolar patients and to controls. Since the patients recruited are in a remission phase, we may hypothesize that the variations observed are expression of the underlying pathology and thus not due to delusions, hallucinations, hypomania, mania and depression. Under the notation of “text architecture” we have included items such as coherence and cohesion and the anaphoric chains.The complete, quantitative results will be soon available, but it is possible to show at least the set of variables for which the texts will be surveyed, in order to account for the lexical variety we registered. The syntactic and semantic opposition of NEW and GIVEN information is the basis according to which both formal and content items will be annotated. Therefore, as an example of the connection between formal and content items, the use of definite or indefinite articles will be linked to the introduction of the following new narrative elements: • a new character is introduced in the plot, be it a human being or an animal; • a shift to a character already in the plot; • a new object appears in the scene; • a shift to an object already in the plot; • a shift of location without a verb group expressing the transfer. The difficulty of recognizing any shift of time of the story without the appropriate time adverb or expression in the discourse, made it not relevant to register this variation since it would be necessarily correlated with the correct formal element. The non-coherent and often non-cohesive structure of many patients’ dreams shows to be relevant to determine what the ratios indicate as a richer lexicon. The consistent, abrupt introduction of NEW pieces of information, especially when correlated with the formal features, is likely to be one of the most significant differences between the control persons and patients, as well as between persons of the first two large psychopathologic categories analyzed here, bipolar disorders and psychotic disorders. Our research is an ongoing project that aims to amplify samples and evaluate the possible influences of pharmacological treatments on the oniric characteristics being analyzed. We have elaborated a measurement system for interruptions that we called “Scale of Silence”; it provides several special mark-ups for the interruptions during the speech. The scale (that is in a testing phase) has three types of tags according to the length of the interruption (0-2 sec; 2-4 sec; 4 or more). Special tags are also used for freudian slips and self-corrections the narrator uses. For such an accurate analysis the research group has started marking the texts of the dreams reports.The marking will allow the computer-aided analyses on semantics, syntax, rhetorical organization, metaphorical references and text architecture. _____________________________________________________________________________ 224 Digital Humanities 2008 _____________________________________________________________________________ The Chinese version of JGAAP supports Chinese word segmentation first then followed by a feature selection process at word level, as preparation for a later analytic phase. After getting a set of ordered feature vectors, we then use different analytical methods to produce authorship judgements. Unfortunately, the errors introduced by the segmentation method(s) will almost certainly influence the final outcome, creating a need for testing. Almost all methods for Chinese word segmentation developed so far are either structural (Wang et al., 1991) and statisticalbased (Lua, 1990).A structural algorithm resolves segmentation ambiguities by examining the structural relationship between words, while a statistical algorithm usually compares word frequency and character co-occurrence probability to detect word boundaries. The difficulties in this study are the ambiguity resolution and novel word detection (personal names, company names, and so on). We use a combination of Maximum Matching and conditional probabilities to minimize this error. Maximum matching (Liu et al., 1994) is one of the most popular structural segmentation algorithms, the process from the beginning of the text to the end is called Forward Maximal Matching (FMM), the other way is called Backward Maximal Matching (BMM). A large lexicon that contains all the possible words in Chinese is usually used to find segmentation candidates for input sentences. Here we need a lexicon that not only has general words but also contains as many personal names, company names, and organization names as possible for detecting new words. Before we scan the text we apply certain rules to divide the input sentence into small linguistic blocks, such as separating the document by English letters, numbers, and punctuation, giving us small pieces of character strings. The segmentation then starts from both directions of these small character strings. The major resource of our segmentation system is this large lexicon. We compare these linguistic blocks with the words in the lexicon to find the possible word strings. If a match is found, one word is segmented successfully.We do this for both directions, if the result is same then this segmentation is accomplished. If not, we take the one that has fewer words. If the number of words is the same, we take the result of BMM as our result. As an example : Suppose ABCDEFGH is a character string, and our lexicon contains the entries A, AB, ABC, but not ABCD. For FMM, we start from the beginning of the string (A) If A is found in the lexicon, we then look for AB in the lexicon. If AB is also found, we look for ABC and so on, till the string is not found. For example, ABCD is not found in the Lexicon, so we consider ABC as a word, then we start from character D unti the end of this character string. BMM is just the opposite direction, starting with H, then GH, then FGH, and so forth. Suppose the segmentation we get from FMM is (a) A \ B \ CD \ EFG \ H and the segmentation from BMM is (b) A \ B \ C \ DE \ FG \ H We will take result (a), since it has fewer words. But if what we get from BMM is (c) AB \ C \ DE \ FG \ H We will take result (c), since the numbers of words is same in both method. After the segmentation step we take the advantage of JGAAP’s features and add different event sets according to the characteristics of Chinese, then apply statistical analysis to determine the final results. It is not clear at this writing, for example, if the larger character set of Chinese will make character-based methods more effective in Chinese then they are in other languages written with the Latin alphabet (like English). It is also not clear whether the segmentation process will produce the same type of set of useful “function words” that are so useful in English authorship attribution.The JGAAP structure (Juola et al, 2006;Juola et al., submitted), however, will make it easy to test our system using a variety of different methods and analysis algorithms. In order to test the performance on Chinese of our software, we are in the process of constructing a Chinese test corpus. We will select three popular novelists and ten novels from each one, eight novels from each author will be used as training data, the other two will be used as testing data. We will also test on the blogs which will be selected from internet. The testing procedure will be the same as with the novels. This research demonstrates, first, the JGAAP structure can easily be adapted to the problems of non-Latin scripts and not English languages, and second, provides somes cues to the best practices of authorship attribution in Chinese. It can hopefully be extended to the development of other non-Latin systems for authorship attribution. References: Patrick Juola, (in press). Authorship Attribution. Delft:NOW Publishing. Patrick Juola, John Noecker, and Mike Ryan. (submitted). “JGAAP3.0: Authorship Attribution for the Rest of Us.” Submitted to DH2008. Patrick Juola, John Sofko, and Patrick Brennan. (2006). “A Prototype for Authorship Attribution Studies.” Literary and Linguistic Computing 21:169-178 _____________________________________________________________________________ 225 Digital Humanities 2008 _____________________________________________________________________________ Yuan Liu, Qiang Tan, and Kun Xu Shen. (1994). “The Word Segmentation Rules and Automatic Word Segmentation Methods for Chinese Information Processing” (in Chinese). Qing Hua University Press and Guang Xi Science and Technology Press, page 36. Automatic Link-Detection in Encoded Archival Descriptions K. T. Lua. (1990). From Character to Word. An Application of Information Theory. Computer Processing of Chinese and Oriental Languages, Vol. 4, No. 4, pages 304--313, March. Junte Zhang Liang-Jyh Wang, Tzusheng Pei, Wei-Chuan Li, and Lih-Ching R. Huang. (1991). “A Parsing Method for Identifying Words in Mandarin Chinese Sentences.” In Processings of 12th International Joint Conference on Artificial Intelligence, pages 1018--1023, Darling Harbour, Sydney, Australia, 24-30 August. Khairun Nisa Fachry j.zhang@uva.nl University of Amsterdam,The Netherlands k.n.fachry@uva.nl University of Amsterdam,The Netherlands Jaap Kamps j.kamps@uva.nl University of Amsterdam,The Netherlands In this paper we investigate how currently emerging link detection methods can help enrich encoded archival descriptions. We discuss link detection methods in general, and evaluate the identification of names both within, and across, archival descriptions. Our initial experiments suggest that we can automatically detect occurrences of person names with high accuracy, both within (F-score of 0.9620) and across (F-score of 1) archival descriptions. This allows us to create (pseudo) encoded archival context descriptions that provide novel means of navigation, improving access to the vast amounts of archival data. Introduction Archival finding aids are complex multi-level descriptions of the paper trails of corporations, persons and families. Currently, finding aids are increasingly encoded in XML using the standard Encoded Archival Descriptions. Archives can cover hundreds of meters of material, resulting in long and detailed EAD documents. We use a dataset of 2,886 EAD documents from the International Institute of Social History (IISG) and 3,119 documents from the Archives Hub, containing documents with more than 100,000 words. Navigating in such archival finding aids becomes non-trivial, and it is easy to loose overview of the hierarchical structure. Hence, this may lead to the loss of important contextual information for interpreting the records. Archival context may be preserved through the use of authority records capturing information about the record creators (corporations, persons, or families) and the context of record creation. By separating the record creator’s descriptions from the records or resources descriptions themselves, we can create “links” from all occurrences of the creators to this context. The resulting descriptions of record creators can be encoded in XML using the emerging Encoded Archival Context (EAC) standard. Currently, EAC has only been applied experimentally. One of the main barriers to adoption is that it requires substantial effort to adopt EAC. The information for the creator’s authority record is usually available in some _____________________________________________________________________________ 226 Digital Humanities 2008 _____________________________________________________________________________ form (for example, EAD descriptions usually have a detailed field <bioghist> about the archive’s creator). However, linking such a context description to occurrences of the creator in the archival descriptions requires more structure than that is available in legacy data. Our main aim is to investigate if and how automatic link detection methods could help improve archival access. Automatic link detection studies the discovery of relations between various documents. Such methods have been employed to detect “missing” links on the Web and recently in the online encyclopedia Wikipedia. Are link detection methods sufficiently effective to be fruitfully applied to archival descriptions? To answer this question, we will experiment on the detection of archival creators within and across finding aids. Based on our findings, we will further discuss how detected links can be used to provide crucial contextual information for the interpretation of records, and to improve navigation within and across finding aids. Link Detection Methods Links generated by humans are abundant on the World Wide Web, and knowledge repositories like the online encyclopedia Wikipedia.There are two kinds of links: incoming and outgoing links. Substrings of text nodes are identified as anchors and become clickable. Incoming links come from text nodes of target files (destination node) and point to a source file (origin node), while an outgoing link goes from text node in the source document (origin node) to a target file (destination node). Two assumptions are made: a link from document A to document B is a recommendation of B by A, and documents linked to each other are related. To automatically detect whether two nodes are connected, it is necessary to search the archives for some string that both share. Usually it is only one specific and extract string. A general approach to automatic link detection is first to detect the global similarity between documents. After the relevant set of documents has been collected, the local similarity can be detected by comparing text segments with other text segments in those files. In structured documents like archival finding aids, these text segments are often marked up as logical units, whether it be the title <titleproper>, the wrapper element <c12> deep down in the finding aid, or the element <persname> that identifies some personal names. These units are identified and retrieved in XML Element retrieval. The identification of relevant anchors is a key problem, as these are used in the system’s retrieval models to point to (parts of) related finding aids. Experiment: Name Detection markup and punctuation), of which 4,979 are unique, and where a token is a sequence of non-space characters. We collect a list of the name variants that we expect to encounter: “J.M. Den Uyl”, “Joop M. Den Uyl”, “Johannes Marten den Uyl”, “Den Uyl”, etc. We construct a regular expression to fetch the name variants. The results are depicted in illustration 1, which shows the local view of the Joop den Uyl archive in our Retrieving EADs More Effectively (README) system. Illustration 1: Links detected in EAD The qua