This page is intended to give some idea of what human and computational steps are involved in going from cuneiform text to graphical syntax tree.
Cuneiform documents are read from originals, photos or hand-copies. This is what Gudea inscription 3 looks like in hand-copy: |
Documents are transliterated by humans, i.e., rendered in romanized script normally with hyphens joining signs which belong to the same word and spaces separating words (the PPCS transliterations input to the next processing phase are in the CDLI/PSD ATF format This is what Gudea inscription 3 looks like in transliteration: |
@composite
&Q000108 = Gudea 3
1. {d}ba-ba6
2. munus sag9-ga
3. dumu an-na
4. nin-a-ni
5. gu3-de2-a
6. ensi2
7. lagasz{ki}-ke4
8. e2 iri-kug#-ga-ka-ni
9. mu-na-du3
|
|
Sumerian writing does not represent the grammar with a simple one-sign = one-morpheme relationship. A program from SumKit does morphological segmentation, i.e., reads the transliteration and outputs transcription. Here's Gudea 3 in transcription: |
1. Bau 2. munus sag;a 3. dumu An,ak 4. nin,ani 5. Gudea 6. ensik 7. Lagasz,ak.e 8. e Irikug,ak.ani 9. mu.na:du |
|
SumKit determines sentence boundaries by identifying finite, non-subordinate verbs based on the morphological segmentation. A web interface enables easy editing of the sentence boundaries if they need correction. The user simply checks and unchecks boxes appropriately and after a commit the previous parsing operations are automatically rerun using the new sentence boundaries: |
Disambiguation (determining which of several forms and words is the correct one to choose given the context) is performed by the next SumKit phase according to rule sets which are specifiable in a purpose-designed mini-language.
|
A further SumKit module performs phrasal parsing of the sentences. The phrases are annotated primarily with morphological category information, but some syntactic classification is also performed. The internal structure of the SumKit phrasal output is in XML format in a form based on XCES, and looks like this (click here to see the entire file): |
<?xml version="1.0"?>
<refs xmlns="http://emegir.info/syntax-tree"
xml:id="Q000108" n="Gudea 3">
<ref ref="0" xml:id="Q000108.R0">
<struct id="s0">
<feat type="syn-cat">S</feat>
<struct id="s1">
<feat type="syn-cat">NP</feat>
<feat type="ms-case">DAT</feat>
<struct id="s2">
<feat type="ms-cat">DN</feat>
<seg target="x6b3917fb"/>
<data item="Bau"/>
</struct>
<struct id="s3">
<feat type="syn-cat">NP</feat>
<feat type="parenthetic">PRN</feat>
<struct id="s4">
<feat type="ms-cat">N</feat>
<seg target="x7bfcc2d6"/>
<data item="munus">
<gloss>woman</gloss>
</data>
</struct>
|
|
Fortunately, humans don't have to look at the XML above. SumKit also includes tools to convert the internal formats into display versions more suitable for human viewing. Here is the one that is used for ePSD, for example: |
|
The default PPCS format follows the Lisp-ish parenthetic stylings typical of tree-bank projects: |
|
The Syntax Tree Viewer/Editor, STVE, produces a real visual tree which can be navigated around and edited to fix any errors that the phrasal parsing layer makes: |
Questions about PPCS can be directed to Fumi Karahashi or Steve Tinney.