The Penn Parsed Corpus of Sumerian

What is PPCS?

PPCS is a research project to build a complete parsed corpus of Sumerian in the style of the Penn Treebank, and to implement an XML syntax for Treebank output which enables it to be fed back into the Pennsylvania Sumerian Dictionary (PSD) to provide phrase and sentence-level parsed context for semantic analysis. The lead researcher is Fumi Karahashi; technical lead is Steve Tinney. Tony Kroch and Beatrice Santorini have provided invaluable consultations; Philip Jones and Tonia Sharlach of the PSD are the principal creators of the Sumerian word list which the parser uses as the basis for part-of-speech categorization.

Project Goals

The initial goal is to produce a complete, exhaustive corpus of all known Sumerian texts which has been analyzed at the morphological and syntactic level, with semantic glosses of lexical items in the corpus.

The corpus is intended to support rigorous systematic and statistical evaluation of propositions about the nature of the Sumerian language. PPCS markup is designed to augment the work of existing online corpora projects.

This corpus is being developed in phases according to this roadmap. The corpus will be made freely available on this website in both XML and the more traditional treebank-style paren-format.

Project Data and Documentation

The PPCS Pilot Project, funded by the University of Pennsylvania's University Research Foundation, runs from August 2003 to the end of June 2004. During this time we have:


Questions about PPCS can be directed to Fumi Karahashi or Steve Tinney.