Greek is an independent branch of the Indo-European family of languages, native to Greece and other parts of the Eastern Mediterranean. It has the longest documented history of any living language, spanning 34 centuries of written records. Its writing system has been the Greek alphabet for the major part of its history; other systems, such as Linear B and the Cypriot syllabary, were used previously. The alphabet arose from the Phoenician script and was in turn the basis of the Latin, Cyrillic, Armenian, Coptic, Gothic and many other writing systems. (Source: Wikipedia)


For most of the following operations, you must first import the CLTK Greek linguistic data (named greek_models_cltk).


The Greek vowels and consonants in upper and lower case are placed in cltk/corpus/greek/

Greek vowels can occur without any breathing or accent, have rough or smooth breathing, different accents, diareses, macrons, breves and combinations thereof and Greek consonants have none of these features, except ρ, which can have rough or smooth breathing.

In the vowels and consonants are grouped by upper or lower case, accent, breathing, a diaresis and possible combinations thereof. These groupings are stored in lists or, in case of a single letter like ρ, as strings with descriptive names structured like CASE_SPECIFIERS, e.g. LOWER_DIARESIS_CIRCUMFLEX.

For example to use upper case vowels with rough breathing and an acute accent:

In[1]: from cltk.corpus.greek.alphabet import UPPER_ROUGH_ACUTE
Out[2]: ['Ἅ', 'Ἕ', 'Ἥ', 'Ἵ', 'Ὅ', 'Ὕ', 'Ὥ', 'ᾍ', 'ᾝ', 'ᾭ']

Accents indicate the pitch of vowels. An acute accent or ὀξεῖα (oxeîa) indicates a rising pitch on a long vowel or a high pitch on a short vowel, a grave accent or βαρεῖα (bareîa) indicates a normal or low pitch and a circumflex or περισπωμένη (perispōménē) indicates high or falling pitch within one syllable.

Breathings, which are used not only on vowels, but also on ρ, indicate the presence or absence of a voiceless glottal fricative - rough breathing indicetes a voiceless glottal fricative before a vowel, like in αἵρεσις (haíresis) and smooth breathing indicates none.

Diareses are placed on ι and υ to indicate two vowels not being a diphthong and macrons and breves are placed on α, ι, and υ to indicate the length of these vowels.

For more information on Greek diacritics see the corresponding wikipedia page.

Accentuation and diacritics

James Tauber has created a Python 3 based library to enable working with the accentuation of Ancient Greek words. Installing it is optional for working with CLTK.

For further information please see the original docs, as this is just an abridged version.

The library can be installed with pip:

pip install greek-accentuation

Contrary to the original docs to use the functions from this module it is necessary to explicitly import every function you need as opposed to

The Characters Module:

base returns a given character without diacritics. For example:

In[1]: from greek_accentuation.characters import base

In[2]: base('ᾳ')
Out[2]: 'α'

add_diacritic and add_breathing add diacritics (accents, diaresis, macrons, breves) and breathing symbols to the given character. add_diacritic is stackable, for example:

In[1]: from greek_accentuation.characters import add_diacritic

In[2]: add_diacritic(add_diacritic('ο', ROUGH), ACUTE)
Out[2]: 'ὅ'

accent and strip_accents return the accent of a character as an Unicode escape and the character stripped of its accent respectively. breathing, strip_breathing, length and strip_length work analogously, for example:

In[1]: from greek_accentuation.characters import length, strip_length

In[2]: length('ῠ') == SHORT
Out[2]: True

In[3]: strip_length('ῡ')
Out[3]: 'υ'

If a length diacritic becomes redundant because of a circumflex it can be stripped with remove_redundant_macron just like strip_length above.

The Syllabify Module:

syllabify splits the given word in syllables, which are returned as a list of strings. Words without vowels are syllabified as a single syllable. The syllabification can also be displayed as a word with the syllablles separated by periods with display_word.

In[1]: from greek_accentuation.syllabify import syllabify, display_word

In[2]: syllabify('γυναικός')
Out[2]: ['γυ', 'ναι', 'κός']

In[3]: syllabify('γγγ')
Out[3]: ['γγγ']

In[4]: display_word(syllabify('καταλλάσσω'))
Out[4]: 'κα.ταλ.λάσ.σω'

is_vowel and is_diphthong return a boolean value to determine whether a given character is a vowel or two given characters are a diphthong.

In[1]: from greek_accentuation.syllabify import is_diphthong

In[2]: is_diphthong('αι')
Out[2]: True

ultima, antepenult and penult return the ultima, antepenult or penult (i.e. the last, next-to-last or third-from-last syllables) of the given word. A syllable can also be further broken down into its onset, nucleus and coda (i.e. the starting consonant, middle part and ending consonant) with the functions named accordingly. rime returns the sequence of a syllable’s nucleus and coda and body returns the sequence of a syllable’s onset and nucleus.

onset_nucleus_coda returns a syllable’s onset, nucleus and coda all at once as a triple.
In[1]: from greek_accentuation.syllabify import ultima, rime, onset_nucleus_coda

In[2]: ultima('γυναικός')
Out[2]: 'κός'

In[3]: rime('κός')
Out[3]: 'ός'

In[4]: onset_nucleus_coda('ναι')
Out[4]: ('ν', 'αι', '')

debreath returns a word with the smooth breathing removed and the rough breathing replaced with an h. rebreath reverses debreath.

In[1]: from greek_accentuation.syllabify import debreath, rebreath

In[2]: debreath('οἰκία')
Out[2]: 'οικία'

In[3]: rebreath('οικία')
Out[3]: 'οἰκία'

In[3]: debreath('ἑξεῖ')
Out[3]: 'hεξεῖ'

In[4]: rebreath('hεξεῖ')
Out[4]: 'ἑξεῖ'

syllable_length returns the length of a syllable (in the linguistic sense) and syllable_accent extracts a syllable’s accent.

In[1]: from greek_accentuation.syllabify import syllable_length, syllable_accent

In[2]: syllable_length('σω') == LONG
Out[2]: True

In[3]: syllable_accent('ναι') is None
Out[3]: True

The accentuation class of a word such as oxytone, paroxytone, proparoxytone, perispomenon, properispomenon or barytone can be tested with the functions named accordingly.

add_necessary_breathing adds smooth breathing to a word if necessary.

In[1]: from greek_accentuation.syllabify import add_necessary_breathing

In[2]: add_necessary_breathing('οι')
Out[2]: 'οἰ'

In[3]: add_necessary_breathing('οἰ')
Out[3]: 'οἰ'

The Accentuation Module:

get_accent_type returns the accent type of a word as a tuple of the syllable number and accent, which is comparable to the constants provided. The accent type can also be displayed as a string with display_accent_type.

In[1]: from greek_accentuation.accentuation import get_accent_type, display_accent_type

In[2]: get_accent_type('ἀγαθοῦ') == PERISPOMENON
Out[2]: True

In[3]: display_accent_type(get_accent_type('ψυχή'))
Out[3]: 'oxytone'

syllable_add_accent(syllable, accent) adds the given accent to a syllable. It is also possible to add an accent class to a syllable, for example:

In[1]: from greek_accentuation.accentuation import syllable_add_accent, make_paroxytone

In[2]: syllable_add_accent('ου', CIRCUMFLEX)
Out[2]: 'οῦ'

In[3]: make_paroxytone('λογος')
Out[3]: 'λόγος'

possible_accentuations returns all possible accentuations of a given syllabification according to Ancient Greek accentuation rules. To treat vowels of unmarked length as short vowels set default_short = True in the function parameters.

In[1]: from greek_accentuation.accentuation import possible_accentuations

In[2]: s = syllabify('εγινωσκου')

In[3]: for accent_class in possible_accentuations(s):

In[4]:     print(add_accent(s, accent_class))
Out[4]: εγινώσκου
Out[4]: εγινωσκού
Out[4]: εγινωσκοῦ

In[5]: s = syllabify('κυριος')

In[6]: for accent_class in possible_accentuations(s, default_short=True):

In[7]:     print(add_accent(s, accent_class))
Out[7]: κύριος
Out[7]: κυρίος
Out[7]: κυριός

recessive finds the most recessive (i.e. as far away from the end of the word as possible) accent and returns the given word with that accent. A | can be placed to set a point past which the accent will not recede. on_penult places the accent on the penult (third-from-last syllable).

In[1]: from greek_accentuation.accentuation import recessive, on_penult

In[2]: recessive('εἰσηλθον')
Out[2]: 'εἴσηλθον'

In[3]: recessive('εἰσ|ηλθον')
Out[3]: 'εἰσῆλθον'

In[4]: on_penult('φωνησαι')
Out[4]: 'φωνῆσαι'

persistent gets passed a word and a lemma (i.e. the canonical form of a set of words) and derives the accent from these two words.

In[1]: from greek_accentuation.accentuation import persistent

In[2]: persistent('ἀνθρωπου', 'ἄνθρωπος')
Out[2]: 'ἀνθρώπου'

Expand iota subscript:

The CLTK offers one transformation that can be useful in certain types of processing: Expanding the iota subsctipt from a unicode point and placing beside, to the right, of the character.

In [1]: from cltk.corpus.greek.alphabet import expand_iota_subscript

In [2]: s = 'εἰ δὲ καὶ τῷ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῷ'

In [3]: expand_iota_subscript(s)
Out[3]: 'εἰ δὲ καὶ τῶΙ ἡγεμόνι πιστεύσομεν ὃν ἂν Κῦρος διδῶΙ'

In [4]: expand_iota_subscript(s, lowercase=True)
Out[4]: 'εἰ δὲ καὶ τῶι ἡγεμόνι πιστεύσομεν ὃν ἂν κῦρος διδῶι'

Converting Beta Code to Unicode

Note that incoming strings need to begin with an r and that the Beta Code must follow immediately after the initial """, as in input line 2, below.

In [1]: from cltk.corpus.greek.beta_to_unicode import Replacer


In [3]: r = Replacer()

In [4]: r.beta_code(BETA_EXAMPLE)
Out[4]: 'ὅπως οὖν μὴ ταὐτὸ πάθωμεν ἐκείνοις, ἐπὶ τὴν διάγνωσιν αὐτῶν ἔρχεσθαι δεῖ πρῶτον. τινὲς μὲν οὖν αὐτῶν εἰσιν ἀκριβεῖς, τινὲς δὲ οὐκ ἀκριβεῖς ὄντες μεταπίπτουσιν εἰς τοὺς ἐπὶ σήψει· οὕτω γὰρ καὶ λοῦσαι καὶ θρέψαι καλῶς καὶ μὴ λοῦσαι πάλιν, ὅτε μὴ ὀρθῶς δυνηθείημεν.'

The beta code converter can also handle lowercase notation:

In [5]: BETA_EXAMPLE_2 = r"""me/xri me\n w)/n tou/tou a(rpaga/s mou/nas ei)=nai par' a)llh/lwn, to\ de\ a)po\ tou/tou *(/ellhnas dh\ mega/lws ai)ti/ous gene/sqai: prote/rous ga\r a)/rcai strateu/esqai e)s th\n *)asi/hn h)\ sfe/as e)s th\n *eu)rw/phn. """
Out[5]: 'μέχρι μὲν ὤν τούτου ἁρπαγάς μούνας εἶναι παρ’ ἀλλήλων, τὸ δὲ ἀπὸ τούτου Ἕλληνας δὴ μεγάλως αἰτίους γενέσθαι· προτέρους γὰρ ἄρξαι στρατεύεσθαι ἐς τὴν Ἀσίην ἢ σφέας ἐς τὴν Εὐρώπην.'

Converting TLG texts with TLGU

The TLGU is excellent C language software for converting the TLG and PHI corpora into human-readable Unicode. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. When TLGU() is instantiated, it checks the local OS for a functioning version of the software. If not found it is, following the user’s confirmation, downloaded and installed.

Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers). Note that you must import a local corpus before converting it.

In [1]: from cltk.corpus.greek.tlgu import TLGU

In [2]: t = TLGU()

In [3]: t.convert_corpus(corpus='tlg')  # writes to: ~/cltk_data/greek/text/tlg/plaintext/

For the PHI7, you may declare whether you want the corpus to be written to the greek or latin directories. By default, it writes to greek.

In [5]: t.convert_corpus(corpus='phi7')  # ~/cltk_data/greek/text/phi7/plaintext/

In [6]: t.convert_corpus(corpus='phi7', latin=True)  # ~/cltk_data/latin/text/phi7/plaintext/

The above commands take each author file and convert them into a new author file. But the software has a useful option to divide each author file into a new file for each work it contains. Thus, Homer’s file, TLG0012.TXT, becomes TLG0012.TXT-001.txt, TLG0012.TXT-002.txt, and TLG0012.TXT-003.txt. To achieve this, use the following command for the TLG:

In [7]: t.divide_works('tlg')  # ~/cltk_data/greek/text/tlg/individual_works/

You may also convert individual files, with options for how the conversion happens.

In [3]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt')

In [4]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', markup='full')

In [5]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', rm_newlines=True)

In [6]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', divide_works=True)

For convert(), plain arguments may be sent directly to the TLGU, as well, via extra_args:

In [7]: t.convert('~/Downloads/corpora/TLG_E/TLG0003.TXT', '~/Documents/thucydides.txt', extra_args=['p', 'B'])

Even after plaintext conversion, the TLG will still need some cleanup. The CLTK contains some code for post-TLGU cleanup.

You may read about these arguments in the TLGU manual.

Once these files are created, see TLG Indices below for accessing these newly created files.

Corpus Readers

Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There is one for Perseus Greek, and others will be made available. The CorpusReader methods: paras() returns paragraphs, if possible; words() returns a generator of words; sentences returns a generator of sentences; docs returns a generator of Python dictionary objects representing each document.

In [1]: from cltk.corpus.readers import get_corpus_reader
   ...: reader = get_corpus_reader( corpus_name = 'greek_text_perseus', language = 'greek')
   ...: # get all the docs
   ...: docs = list(
   ...: len(docs)
Out[1]: 222

In [2]: # or set just one
   ...: reader._fileids = ['plato__apology__grc.json']

In [3]: # get all the sentences
In [4]: sentences = list(reader.sents())
   ...: len(sentences)
Out[4]: 4983

In [5]: # Or just one

In [6]: sentences[0]
Out[6]: '\n \n \n \n \n ὅτι μὲν ὑμεῖς, ὦ ἄνδρες Ἀθηναῖοι, πεπόνθατε ὑπὸ\n τῶν ἐμῶν κατηγόρων, οὐκ οἶδα· ἐγὼ δʼ οὖν καὶ αὐτὸς ὑπʼ αὐτῶν ὀλίγου ἐμαυτοῦ\n ἐπελαθόμην, οὕτω πιθανῶς ἔλεγον.'

In [7]: # access an individual doc as a dictionary of dictionaries
   ...: doc = list([0]
   ...: doc.keys()
Out[7]: dict_keys(['language', 'englishTitle', 'original-urn', 'author', 'urn', 'text', 'source', 'originalTitle', 'edition', 'sourceLink', 'meta', 'filename'])

Information Retrieval

See Multilingual Information Retrieval for Greek–specific search options.



For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.

The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.

The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens. Here is an example of the lemmatizer taking a string:

In [1]: from cltk.stem.lemma import LemmaReplacer

In [2]: sentence = 'τὰ γὰρ πρὸ αὐτῶν καὶ τὰ ἔτι παλαίτερα σαφῶς μὲν εὑρεῖν διὰ χρόνου πλῆθος ἀδύνατα ἦν'

In [3]: from cltk.corpus.utils.formatter import cltk_normalize

In [4]: sentence = cltk_normalize(sentence)  # can help when using certain texts

In [5]: lemmatizer = LemmaReplacer('greek')

In [6]: lemmatizer.lemmatize(sentence)

And here taking a list:

In [5]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'])
Out[5]: ['χρόνος', 'πλῆθος', 'ἀδύνατος', 'εἰμί']

The lemmatizer takes several optional arguments for controlling output: return_raw=True and return_string=True. return_raw returns the original inflection along with its headword:

In [6]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_raw=True)
Out[6]: ['χρόνου/χρόνος', 'πλῆθος/πλῆθος', 'ἀδύνατα/ἀδύνατος', 'ἦν/εἰμί']

And return string wraps the list in ' '.join():

In [7]: lemmatizer.lemmatize(['χρόνου', 'πλῆθος', 'ἀδύνατα', 'ἦν'], return_string=True)
Out[7]: 'χρόνος πλῆθος ἀδύνατος εἰμί'

These two arguments can be combined, as well.

Lemmatization, backoff method

The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.

There is a generic version of the backoff Greek lemmatizer which requires data from the CLTK greek models data found here <>. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.

To use the generic version of the backoff Greek Lemmatizer:

In [1]: from cltk.lemmatize.greek.backoff import BackoffGreekLemmatizer

In [2]: lemmatizer = BackoffGreekLemmatizer()

In [3]: tokens = 'κατέβην χθὲς εἰς Πειραιᾶ μετὰ Γλαύκωνος τοῦ Ἀρίστωνος'.split()

In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('κατέβην', 'καταβαίνω'), ('χθὲς', 'χθές'), ('εἰς', 'εἰς'), ('Πειραιᾶ', 'Πειραιᾶ'), ('μετὰ', 'μετά'), ('Γλαύκωνος', 'Γλαύκων'), ('τοῦ', 'ὁ'), ('Ἀρίστωνος', 'Ἀρίστων')]

NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on sentences from the [Perseus Latin Dependency Treebanks](; 3. a regular-expression-based lemmatizer transforming unambiguous endings (currently very limited); 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.

Named Entity Recognition

There is available a simple interface to a list of Greek proper nouns (see repo for how it the list was created). By default tag_ner() takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.

In [1]: from cltk.tag import ner

In [2]: text_str = 'τὰ Σίλαριν Σιννᾶν Κάππαρος Πρωτογενείας Διονυσιάδες τὴν'

In [3]: ner.tag_ner('greek', input_text=text_str, output_type=list)
 ('Σίλαριν', 'Entity'),
 ('Σιννᾶν', 'Entity'),
 ('Κάππαρος', 'Entity'),
 ('Πρωτογενείας', 'Entity'),
 ('Διονυσιάδες', 'Entity'),


Normalizing polytonic Greek is a problem that has been mostly solved, however when working with legacy applications issues still arise. We recommend normalizing Greek vowels in order to ensure string matching.

One type of normalization issue comes from tonos accents (intended for Modern Greek) being used instead of the oxia accents (for Ancient Greek). Here is an example of two characters appearing identical but being in fact dissimilar:

In [1]: from cltk.corpus.utils.formatter import tonos_oxia_converter

In [2]: char_tonos = "ά"  # with tonos, for Modern Greek

In [3]: char_oxia = "ά"  # with oxia, for Ancient Greek

In [4]: char_tonos == char_oxia
Out[4]: False

In [5]: ord(char_tonos)
Out[5]: 940

In [6]: ord(char_oxia)
Out[6]: 8049

In [7]: char_oxia == tonos_oxia_converter(char_tonos)
Out[7]: True

If for any reason you want to go from oxia to tonos, just add the reverse=True parameter:

In [8]: char_tonos == tonos_oxia_converter(char_oxia, reverse=True)
Out[8]: True

Another approach to normalization is to use the Python language’s builtin normalize(). The CLTK provides a wrapper for this, as a convenience. Here’s an example its use in “compatibility” mode (NFKC):

In [1]: from cltk.corpus.utils.formatter import cltk_normalize

In [2]: tonos = "ά"

In [3]: oxia = "ά"

In [4]: tonos == oxia
Out[4]: False

In [5]: tonos == cltk_normalize(oxia)
Out[5]: True

One can turn off compatability with:

In [6]: tonos == cltk_normalize(oxia, compatibility=False)
Out[6]: True

For more on normalize() see the Python Unicode docs.

POS tagging

These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the greek_models_cltk corpus.

1–2–3–gram backoff tagger

In [1]: from cltk.tag.pos import POSTag

In [2]: tagger = POSTag('greek')

In [3]: tagger.tag_ngram_123_backoff('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---MG-'),
 ('᾽', None),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

TnT tagger

In [4]: tagger.tag_tnt('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'Unk'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'N-S---FG-'),
 ('ἐτείας', 'A-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

CRF tagger


This tagger’s accuracy has not yet been tested.

We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.

In [5]: tagger.tag_crf('θεοὺς μὲν αἰτῶ τῶνδ᾽ ἀπαλλαγὴν πόνων φρουρᾶς ἐτείας μῆκος')
[('θεοὺς', 'N-P---MA-'),
 ('μὲν', 'G--------'),
 ('αἰτῶ', 'V1SPIA---'),
 ('τῶνδ', 'P-P---NG-'),
 ('᾽', 'A-S---FA-'),
 ('ἀπαλλαγὴν', 'N-S---FA-'),
 ('πόνων', 'N-P---MG-'),
 ('φρουρᾶς', 'A-S---FG-'),
 ('ἐτείας', 'N-S---FG-'),
 ('μῆκος', 'N-S---NA-')]

Prosody Scanning

There is a prosody scanner for scanning rhythms in Greek texts. It returns a list of strings or long and short marks for each sentence. Note that the last syllable of each sentence string is marked with an anceps so that specific clausulae are dileneated.

In [1]: from cltk.prosody.greek.scanner import Scansion

In [2]: scanner = Scansion()

In [3]: scanner.scan_text('νέος μὲν καὶ ἄπειρος, δικῶν ἔγωγε ἔτι. μὲν καὶ ἄπειρος.')
Out[3]: ['˘¯¯¯˘¯¯˘¯˘¯˘˘x', '¯¯˘¯x']

Sentence Tokenization

Sentence tokenization for Ancient Greek is available using (by default) a regular-expression based tokenizer. To tokenize a Greek text by sentences…

In [1]: from cltk.tokenize.greek.sentence import SentenceTokenizer

In [2]: sent_tokenizer = SentenceTokenizer()

In [3]: untokenized_text = """ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν· τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι· εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας."""

In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['ὅλως δ’ ἀντεχόμενοί τινες, ὡς οἴονται, δικαίου τινός (ὁ γὰρ νόμος δίκαιόν τἰ τὴν κατὰ πόλεμον δουλείαν τιθέασι δικαίαν, ἅμα δ’ οὔ φασιν·', 'τήν τε γὰρ ἀρχὴν ἐνδέχεται μὴ δικαίαν εἶναι τῶν πολέμων, καὶ τὸν ἀνάξιον δουλεύειν οὐδαμῶς ἂν φαίη τις δοῦλον εἶναι·', 'εἰ δὲ μή, συμβήσεται τοὺς εὐγενεστάτους εἶναι δοκοῦντας δούλους εἶναι καὶ ἐκ δούλων, ἐὰν συμβῇ πραθῆναι ληφθέντας.']

The sentence tokenizer takes a string input into tokenize_sentences() and returns a list of strings. For more on the tokenizer, or to make your own, see the CLTK’s Greek sentence tokenizer training set repository.

There is also an experimental [Punkt]( tokenizer trained on the Greek Tesserae texts. The model for this tokenizer can be found in the CLTK corpora under greek_model_cltk/tokenizers/sentence/greek_punkt.

In [5]: from cltk.tokenize.greek.sentence import SentenceTokenizer

In [6]: sent_tokenizer = SentenceTokenizer(tokenizer='punkt')


NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but will soon be replaced by the method above.

In [7]: from cltk.tokenize.sentence import TokenizeSentence

In [8]: tokenizer = TokenizeSentence('greek')


Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from nltk.tokenize.punkt import PunktLanguageVars

In [2]: from cltk.stop.greek.stops import STOPS_LIST

In [3]: sentence = 'Ἅρπαγος δὲ καταστρεψάμενος Ἰωνίην ἐποιέετο στρατηίην ἐπὶ Κᾶρας καὶ Καυνίους καὶ Λυκίους, ἅμα ἀγόμενος καὶ Ἴωνας καὶ Αἰολέας.'

In [4]: p = PunktLanguageVars()

In [5]: tokens = p.word_tokenize(sentence.lower())

In [6]: [w for w in tokens if not w in STOPS_LIST]


The corpus module has a class for generating a Swadesh list for Greek.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('gr')

In [3]: swadesh.words()[:10]
Out[3]: ['ἐγώ', 'σύ', 'αὐτός, οὗ, ὅς, ὁ, οὗτος', 'ἡμεῖς', 'ὑμεῖς', 'αὐτοί', 'ὅδε', 'ἐκεῖνος', 'ἔνθα, ἐνθάδε, ἐνταῦθα', 'ἐκεῖ']


There are several rudimentary corpus converters for the “First 1K Years of Greek” project (download the corpus 'greek_text_first1kgreek'). Both write files to `` ~/cltk_data/greek/text/greek_text_first1kgreek_plaintext``.

This one is built upon the MyCapytain library (pip install lxml MyCapytain), which has the ability for very precise chunking of TEI xml. The following function only preserves numbers:

In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text_capitains

In [2]: onekgreek_tei_xml_to_text_capitains()

For the following, install the BeautifulSoup library (pip install bs4). Note that this will just dump all text not contained within a node’s bracket (including sometimes metadata).

In [1]: from cltk.corpus.greek.tei import onekgreek_tei_xml_to_text

In [2]: onekgreek_tei_xml_to_text()

Text Cleanup

Intended for use on the TLG after processing by TLGU().

In [1]: from cltk.corpus.utils.formatter import tlg_plaintext_cleanup

In [2]: import os

In [3]: file = os.path.expanduser('~/cltk_data/greek/text/tlg/individual_works/TLG0035.TXT-001.txt')

In [4]: with open(file) as f:
...:     r =

In [5]: r[:500]
Out[5]: "\n{ΜΟΣΧΟΥ ΕΡΩΣ ΔΡΑΠΕΤΗΣ} \n  Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει: \n‘ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα, \nδραπετίδας ἐμός ἐστιν: ὁ μανύσας γέρας ἑξεῖ. \nμισθός τοι τὸ φίλημα τὸ Κύπριδος: ἢν δ' ἀγάγῃς νιν, \nοὐ γυμνὸν τὸ φίλημα, τὺ δ', ὦ ξένε, καὶ πλέον ἑξεῖς. \nἔστι δ' ὁ παῖς περίσαμος: ἐν εἴκοσι πᾶσι μάθοις νιν. \nχρῶτα μὲν οὐ λευκὸς πυρὶ δ' εἴκελος: ὄμματα δ' αὐτῷ \nδριμύλα καὶ φλογόεντα: κακαὶ φρένες, ἁδὺ λάλημα: \nοὐ γὰρ ἴσον νοέει καὶ φθέγγεται: ὡς μέλι φωνά, \nὡς δὲ χολὰ νόος ἐστίν: "

In [7]: tlg_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Ἁ Κύπρις τὸν Ἔρωτα τὸν υἱέα μακρὸν ἐβώστρει ὅστις ἐνὶ τριόδοισι πλανώμενον εἶδεν Ἔρωτα δραπετίδας ἐμός ἐστιν ὁ μανύσας γέρας ἑξεῖ. μισθός τοι τὸ φίλημα τὸ Κύπριδος ἢν δ ἀγάγῃς νιν οὐ γυμνὸν τὸ φίλημα τὺ δ ὦ ξένε καὶ πλέον ἑξεῖς. ἔστι δ ὁ παῖς περίσαμος ἐν εἴκοσι πᾶσι μάθοις νιν. χρῶτα μὲν οὐ λευκὸς πυρὶ δ εἴκελος ὄμματα δ αὐτῷ δριμύλα καὶ φλογόεντα κακαὶ φρένες ἁδὺ λάλημα οὐ γὰρ ἴσον νοέει καὶ φθέγγεται ὡς μέλι φωνά ὡς δὲ χολὰ νόος ἐστίν ἀνάμερος ἠπεροπευτάς οὐδὲν ἀλαθεύων δόλιον βρέφος ἄγρια π'

TLG Indices

The TLG comes with some old, difficult-to-parse index files which have been made available as Python dictionaries (at /Users/kyle/cltk/cltk/corpus/greek/tlg). Below are some functions to make accessing these easy. The outputs are variously a dict of an index or set if the function returns unique author ids.


Python sets are like lists, but contain only unique values. Multiple sets can be conveniently combined (see docs here).

In [1]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_female_authors

In [2]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_index

In [3]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithets

In [4]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_epithet

In [5]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_epithet_of_author

In [6]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_index

In [7]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geographies

In [8]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_authors_by_geo

In [9]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_geo_of_author

In [10]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_lists

In [11]: from cltk.corpus.greek.tlg.parse_tlg_indices import get_id_author

In [12]: from cltk.corpus.greek.tlg.parse_tlg_indices import select_id_by_name

In [13]: get_female_authors()

In [14]: get_epithet_index()
{'Lexicographi': {'3136', '4040', '4085', '9003'},
 'Lyrici/-ae': {'0009',

In [15]: get_epithets()

In [16]: select_authors_by_epithet('Tactici')
Out[16]: {'0058', '0546', '0556', '0648', '3075', '3181'}

In [17]: get_epithet_of_author('0016')
Out[17]: 'Historici/-ae'

In [18]: get_geo_index()
{'Alchemistae': {'1016',

In [19]: get_geographies()

In [20]: select_authors_by_geo('Thmuis')
Out[20]: {'2966'}

In [21]: get_geo_of_author('0216')
Out[21]: 'Aetolia'

In [22]: get_lists()
{'Lists pertaining to all works in Canon (by TLG number)': {'LIST3CLA.BIN': 'Literary classifications of works',
  'LIST3CLX.BIN': 'Literary classifications of works (with x-refs)',
  'LIST3DAT.BIN': 'Chronological classifications of authors',

In [23]: get_id_author()
{'1139': 'Anonymi Historici (FGrH)',
 '4037': 'Anonymi Paradoxographi',
 '0616': 'Polyaenus Rhet.',

In [28]: select_id_by_name('hom')
[('0012', 'Homerus Epic., Homer'),
 ('1252', 'Certamen Homeri Et Hesiodi'),
 ('1805', 'Vitae Homeri'),
 ('5026', 'Scholia In Homerum'),
 ('1375', 'Evangelium Thomae'),
 ('2038', 'Acta Thomae'),
 ('0013', 'Hymni Homerici, Homeric Hymns'),
 ('0253', '[Homerus] [Epic.]'),
 ('1802', 'Homerica'),
 ('1220', 'Batrachomyomachia'),
 ('9023', 'Thomas Magister Philol.')]

In addition to these indices there are several helper functions which will build filepaths for your particular computer. Note that you will need to have run convert_corpus(corpus='tlg') and divide_works('tlg') from the TLGU() class, respectively, for the following two functions.

In [1]: from cltk.corpus.utils.formatter import assemble_tlg_author_filepaths

In [2]: assemble_tlg_author_filepaths()

In [3]: from cltk.corpus.utils.formatter import assemble_tlg_works_filepaths

In [4]: assemble_tlg_works_filepaths()

These two functions are useful when, for example, needing to process all authors of the TLG corpus, all works of the corpus, or all works of one particular author.


The CLTK provides IPA phonetic transliteration for the Greek language. Currently, the only available dialect is Attic as reconstructed by Philomen Probert (taken from A Companion to the Ancient Greek Language, 85-103). Example:

In [1]: from cltk.phonology.greek.transcription import Transcriber

In [2]: transcriber = Transcriber(dialect="Attic", reconstruction="Probert")

In [3]: transcriber.transcribe("Διόθεν καὶ δισκήπτρου τιμῆς ὀχυρὸν ζεῦγος Ἀτρειδᾶν στόλον Ἀργείων")
Out[3]: '[di.ó.tʰen kɑj dis.kɛ́ːp.trọː ti.mɛ̂ːs o.kʰy.ron zdêw.gos ɑ.trẹː.dɑ̂n stó.lon ɑr.gẹ́ː.ɔːn]'

Word Tokenization

In [1]: from cltk.tokenize.word import WordTokenizer

In [2]: word_tokenizer = WordTokenizer('greek')

In [3]: text = 'Θουκυδίδης Ἀθηναῖος ξυνέγραψε τὸν πόλεμον τῶν Πελοποννησίων καὶ Ἀθηναίων,'

In [4]: word_tokenizer.tokenize(text)
Out[4]: ['Θουκυδίδης', 'Ἀθηναῖος', 'ξυνέγραψε', 'τὸν', 'πόλεμον', 'τῶν', 'Πελοποννησίων', 'καὶ', 'Ἀθηναίων', ',']



The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.


You will need to install Gensim to use these features.

Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).

The CLTK repository contains pre-trained Word2Vec models for Greek (import as greek_word2vec_cltk), one lemmatized and the other not. They were trained on the TLG corpus. To train your own, see the README at the Greek Word2Vec repository.

One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:

In [1]: from import search_corpus

In [2]: In [6]: for x in search_corpus('πνεῦμα', 'tlg', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.5):
The following similar terms will be added to the 'πνεῦμα' query: '['γεννώμενον', 'ἔντερον', 'βάπτισμα', 'εὐαγγέλιον', 'δέρμα', 'ἐπιῤῥέον', 'ἔμβρυον', 'ϲῶμα', 'σῶμα', 'συγγενὲς']'.
('Lucius Annaeus Cornutus Phil.', "μυθολογεῖται δ' ὅτι διασπασθεὶς ὑπὸ τῶν Τιτά-\nνων συνετέθη πάλιν ὑπὸ τῆς Ῥέας, αἰνιττομένων τῶν \nπαραδόντων τὸν μῦθον ὅτι οἱ γεωργοί, θρέμματα γῆς \nὄντες, συνέχεαν τοὺς βότρυς καὶ τοῦ ἐν αὐτοῖς Διονύσου \nτὰ μέρη ἐχώρισαν ἀπ' ἀλλήλων, ἃ δὴ πάλιν ἡ εἰς ταὐτὸ \nσύρρυσις τοῦ γλεύκους συνήγαγε καὶ ἓν *σῶμα* ἐξ αὐτῶν \nἀπετέλεσε.")
('Metopus Phil.', '\nκαὶ ταὶ νόσοι δὲ γίνονται τῶ σώματος <τῷ> θερμότερον ἢ κρυμωδέσ-\nτερον γίνεσθαι τὸ *σῶμα*.')

threshold is the closeness of the query term to its neighboring words. Note that when expand_keyword=True, the search term will be stripped of any regular expression syntax.

The keyword expander leverages get_sims() (which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:

In [3]: from cltk.vector.word2vec import get_sims

In [4]: get_sims('βασιλεύς', 'greek', lemmatized=False, threshold=0.5)
"word 'βασιλεύς' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['βασκαίνων', 'βασκανίας', 'βασιλάκιος', 'βασιλίδων', 'βασανισθέντα', 'βασιλήϊον', 'βασιλευόμενα', 'βασανιστηρίων', … ]'.

In [36]: get_sims('τυραννος', 'greek', lemmatized=True, threshold=0.7)
"word 'τυραννος' not in vocabulary"
The following terms in the Word2Vec model you may be looking for: '['τυραννίσιν', 'τυρόριζαν', 'τυρεύοντες', 'τυρρηνοὶ', 'τυραννεύοντα', 'τυροὶ', 'τυραννικά', 'τυρσηνίαν', 'τυρώ', 'τυρσηνίας', … ]'.

To add and subtract vectors, you need to load the models yourself with Gensim.