Latin¶
Latin is a classical language belonging to the Italic branch of the Indo-European languages. The Latin alphabet is derived from the Etruscan and Greek alphabets, and ultimately from the Phoenician alphabet. Latin was originally spoken in Latium, in the Italian Peninsula. Through the power of the Roman Republic, it became the dominant language, initially in Italy and subsequently throughout the Roman Empire. Vulgar Latin developed into the Romance languages, such as Italian, Portuguese, Spanish, French, and Romanian. (Source: Wikipedia)
Note
For most of the following operations, you must first import the CLTK Latin linguistic data (named latin_models_cltk
).
Note
Note that for most of the following operations, the j/i and v/u replacer JVReplacer()
and .lower()
should be used on the input string first, if necessary.
Corpus Readers¶
Most users will want to access words, sentences, paragraphs and even whole documents via a CorpusReader object. All Corpus contributors should provide a suitable reader. There are corpus readers for the Perseus Latin collection in json format, and one for the Latin Library, and others will be made available. The CorpusReader methods: paras()
returns paragraphs, if possible; words()
returns a generator of words; sentences
returns a generator of sentences; docs
returns a generator of Python dictionary objects representing each document.
In [1]: from cltk.corpus.readers import get_corpus_reader
...: reader = get_corpus_reader(language='latin', corpus_name='latin_text_perseus')
...: # get all the docs
...: docs = list(reader.docs())
...: len(docs)
...:
Out[1]: 293
In [2]: # or set just one
...: reader._fileids = ['cicero__on-behalf-of-aulus-caecina__latin.json']
...:
In [3]: # get all the sentences
...: sentences = list(reader.sents())
...: len(sentences)
...:
Out[3]: 25435
In [4]: # or one at a time
...: sentences[0]
...:
Out[4]: '\n\t\t\t si , quantum in agro locisque desertis audacia potest, tantum in foro atque\n\t\t\t\tin iudiciis impudentia valeret, non minus nunc in causa cederet A. Caecina Sex.'
In [5]: # access an individual doc as a dictionary of dictionaries
...: doc = list(reader.docs())[0]
...: doc.keys()
...:
Out[5]: dict_keys(['meta', 'author', 'text', 'edition', 'englishTitle', 'source', 'originalTitle', 'original-urn', 'language', 'sourceLink', 'urn', 'filename'])
Clausulae Analysis¶
Clausulae analysis is an integral part of Latin prosimetrics. The clausulae analysis module analyzes prose rhythm data generated by the prosody module to produce a dictionary of common rhythm types and their frequencies.
The list of rhythms which the module tallies is from the paper Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.
In [1]: from cltk.prosody.latin.scanner import Scansion
In [2]: from cltk.prosody.latin.clausulae_analysis import Clausulae
In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'
In [4]: s = Scansion()
In [5]: c = Clausulae()
In [6]: prosody = s.scan_text(text)
Out[6]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']
In [7]: c.clausulae_analysis(prosody)
Out[7]: [{'cretic_trochee': 1}, {'cretic_trochee_resolved_a': 0}, {'cretic_trochee_resolved_b': 0}, {'cretic_trochee_resolved_c': 0}, {'double_cretic': 0}, {'molossus_cretic': 0}, {'double_molossus_cretic_resolved_a': 0}, {'double_molossus_cretic_resolved_b': 0}, {'double_molossus_cretic_resolved_c': 0}, {'double_molossus_cretic_resolved_d': 0}, {'double_molossus_cretic_resolved_e': 0}, {'double_molossus_cretic_resolved_f': 0}, {'double_molossus_cretic_resolved_g': 0}, {'double_molossus_cretic_resolved_h': 0}, {'double_trochee': 0}, {'double_trochee_resolved_a': 0}, {'double_trochee_resolved_b': 0}, {'hypodochmiac': 0}, {'hypodochmiac_resolved_a': 0}, {'hypodochmiac_resolved_b': 0}, {'spondaic': 1}, {'heroic': 0}]
Converting J to I, V to U¶
In [1]: from cltk.stem.latin.j_v import JVReplacer
In [2]: j = JVReplacer()
In [3]: j.replace('vem jam')
Out[3]: 'uem iam'
Converting PHI texts with TLGU¶
Note
- Update this section with new post-TLGU processors in formatter.py
The TLGU is C-language software which does an excellent job at converting the TLG and PHI corpora into various forms of human-readable Unicode plaintext. The CLTK has an automated downloader and installer, as well as a wrapper which facilitates its use. Download and installation is handled in the background. When TLGU()
is instantiated, it checks the local OS for a functioning version of the software. If not found it is installed.
Most users will want to do a bulk conversion of the entirety of a corpus without any text markup (such as chapter or line numbers).
In [1]: from cltk.corpus.greek.tlgu import TLGU
In [2]: t = TLGU()
In [3]: t.convert_corpus(corpus='phi5') # ~/cltk_data/latin/text/phi5/plaintext/
You can also divide the texts into a file for each individual work.
In [4]: t.divide_works('phi5') # ~/cltk_data/latin/text/phi5/individual_works/
Once these files are created, see PHI Indices below for accessing these newly created files.
See also Text Cleanup for removing extraneous non-textual characters from these files.
Information Retrieval¶
See Multilingual Information Retrieval for Latin–specific search options.
Declining¶
The CollatinusDecliner() attempts to retrieve all possible form of a lemma. This may be useful if you want to search for all forms of a word across a repository of non-lemmatized texts. This class is based on lexical and linguistic data built by the Collatinus Team. Data corrections and additions can be contributed back to the Collatinus project (in particular, into bin/data).
Example use, assuming you have already imported the latin_models_cltk:
In [1]: from cltk.stem.latin.declension import CollatinusDecliner
In [2]: decliner = CollatinusDecliner()
In [3]: print(decliner.decline("via"))
Out[3]:
[('via', '--s----n-'),
('via', '--s----v-'),
('viam', '--s----a-'),
('viae', '--s----g-'),
('viae', '--s----d-'),
('via', '--s----b-'),
('viae', '--p----n-'),
('viae', '--p----v-'),
('vias', '--p----a-'),
('viarum', '--p----g-'),
('viis', '--p----d-'),
('viis', '--p----b-')]
In [4]: decliner.decline("via", flatten=True)
Out[4]:
['via',
'via',
'viam',
'viae',
'viae',
'via',
'viae',
'viae',
'vias',
'viarum',
'viis',
'viis']
Lemmatization¶
*This lemmatizer is deprecated. It is recommended that you use the Backoff Lemmatizer described below.*
Tip
For ambiguous forms, which could belong to several headwords, the current lemmatizer chooses the more commonly occurring headword (code here). For any errors that you spot, please open a ticket.
The CLTK’s lemmatizer is based on a key-value store, whose code is available at the CLTK’s Latin lemma/POS repository.
The lemmatizer offers several input and output options. For text input, it can take a string or a list of tokens (which, by the way, need j and v replaced first). Here is an example of the lemmatizer taking a string:
In [1]: from cltk.stem.lemma import LemmaReplacer
In [2]: from cltk.stem.latin.j_v import JVReplacer
In [3]: sentence = 'Aeneadum genetrix, hominum divomque voluptas, alma Venus, caeli subter labentia signa quae mare navigerum, quae terras frugiferentis concelebras, per te quoniam genus omne animantum concipitur visitque exortum lumina solis.'
In [6]: sentence = sentence.lower()
In [7]: lemmatizer = LemmaReplacer('latin')
In [8]: lemmatizer.lemmatize(sentence)
Out[8]:
['aeneadum',
'genetrix',
',',
'homo',
'divus',
'voluptas',
',',
'almus',
...]
And here taking a list:
In [9]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'])
Out[9]: ['qui1', 'terra', 'frugiferens', 'concelebro']
The lemmatizer takes several optional arguments for controlling output: return_raw=True
and return_string=True
. return_raw
returns the original inflection along with its headword:
In [10]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_raw=True)
Out[10]:
['quae/qui1',
'terras/terra',
'frugiferentis/frugiferens',
'concelebras/concelebro']
And return string
wraps the list in ' '.join()
:
In [11]: lemmatizer.lemmatize(['quae', 'terras', 'frugiferentis', 'concelebras'], return_string=True)
Out[11]: 'qui1 terra frugiferens concelebro'
These two arguments can be combined, as well.
Lemmatization, backoff method¶
The CLTK offers a series of lemmatizers that can be combined in a backoff chain, i.e. if one lemmatizer is unable to return a headword for a token, this token can be passed onto another lemmatizer until either a headword is returned or the sequence ends.
There is a generic version of the backoff Latin lemmatizer which requires data from the CLTK latin models data found here. The lemmatizer expects this model to be stored in a folder called cltk_data in the user’s home directory.
To use the generic version of the backoff Latin Lemmatizer:
In [1]: from cltk.lemmatize.latin.backoff import BackoffLatinLemmatizer
In [2]: lemmatizer = BackoffLatinLemmatizer()
In [3]: tokens = ['Quo', 'usque', 'tandem', 'abutere', ',', 'Catilina', ',', 'patientia', 'nostra', '?']
In [4]: lemmatizer.lemmatize(tokens)
Out[4]: [('Quo', 'Quo'), ('usque', 'usque'), ('tandem', 'tandem'), ('abutere', 'abutor'), (',', 'punc'), ('Catilina', 'Catilina'), (',', 'punc'), ('patientia', 'patientia'), ('nostra', 'noster'), ('?', 'punc')]
NB: The backoff chain for this lemmatizer is defined as follows: 1. a dictionary-based lemmatizer with high-frequency, unambiguous forms; 2. a training-data-based lemmatizer based on 4,000 sentences from the [Perseus Latin Dependency Treebanks](https://perseusdl.github.io/treebank_data/); 3. a regular-expression-based lemmatizer transforming unambiguous endings; 4. a dictionary-based lemmatizer with the complete set of Morpheus lemmas; 5. an ‘identity’ lemmatizer returning the token as the lemma. Each of these sub-lemmatizers is explained in the documents for “Multilingual”.
Line Tokenization¶
The line tokenizer takes a string input into tokenize()
and returns a list of strings.
In [1]: from cltk.tokenize.line import LineTokenizer
In [2]: tokenizer = LineTokenizer('latin')
In [3]: untokenized_text = """49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [4]: tokenizer.tokenize(untokenized_text)
Out[4]: ['49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
The line tokenizer by default removes multiple line breaks. If you wish to retain blank lines in the returned list, set the include_blanks
to True
.
In [5]: untokenized_text = """48. Cum tibi contigerit studio cognoscere multa,\nFac discas multa, vita nil discere velle.\n\n49. Miraris verbis nudis me scribere versus?\nHoc brevitas fecit, sensus coniungere binos."""
In [6]: tokenizer.tokenize(untokenized_text, include_blanks=True)
Out[6]: ['48. Cum tibi contigerit studio cognoscere multa,','Fac discas multa, vita nil discere velle.','','49. Miraris verbis nudis me scribere versus?','Hoc brevitas fecit, sensus coniungere binos.']
Macronizer¶
Automatically mark long Latin vowels with a macron. The algorithm used in this module is largely based on Johan Winge’s, which is detailed in his thesis found.
Note that the macronizer’s accuracy varies depending on which tagger is used. Currently, the macronizer supports the following taggers: tag_ngram_123_backoff
, tag_tnt
, and tag_crf
. The tagger is selected when calling the class, as seen on line 2. Be sure to first import the data models from latin_models_cltk
, via the corpus importer, since both the taggers and macronizer rely on them.
The macronizer can either macronize text, as seen at line 4 below, or return a list of tagged tokens containing the macronized form like on line 5.
In [1]: from cltk.prosody.latin.macronizer import Macronizer
In [2]: macronizer = Macronizer('tag_ngram_123_backoff')
In [3]: text = 'Quo usque tandem, O Catilina, abutere nostra patientia?'
In [4]: macronizer.macronize_text(text)
Out[4]: 'quō usque tandem , ō catilīnā , abūtēre nostrā patientia ?
In [5]: macronizer.macronize_tags(text)
Out[5]: [('quo', 'd--------', 'quō'), ('usque', 'd--------', 'usque'), ('tandem', 'd--------', 'tandem'), (',', 'u--------', ','), ('o', 'e--------', 'ō'), ('catilina', 'n-s---mb-', 'catilīnā'), (',', 'u--------', ','), ('abutere', 'v2sfip---', 'abūtēre'), ('nostra', 'a-s---fb-', 'nostrā'), ('patientia', 'n-s---fn-', 'patientia'), ('?', None, '?')]
Making POS training sets¶
Warning
POS tagging is a work in progress. A new tagging dictionary has been created, though a tagger has not yet been written.
First, obtain the Latin POS tagging files. The important file here is cltk_latin_pos_dict.txt
, which is saved at ~/cltk_data/compiled/pos_latin
. This file is a Python dict
type which aims to give all possible parts-of-speech for any given form, though this is based off the incomplete Perseus latin-analyses.txt
. Thus, there may be gaps in (i) the inflected forms defined and (ii) the comprehensiveness of the analyses of any given form. cltk_latin_pos_dict.txt
looks like:
{'-nam': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'-namque': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'-sed': {'perseus_pos': [{'pos0': {'case': 'indeclform',
'gloss': '',
'type': 'conj'}}]},
'Aaron': {'perseus_pos': [{'pos0': {'case': 'nom',
'gender': 'masc',
'gloss': 'Aaron',
'number': 'sg',
'type': 'substantive'}}]},
}
If you wish to edit the POS dictionary creator, see cltk_latin_pos_dict.txt
.For more, see the [pos_latin](https://github.com/cltk/latin_pos_lemmata_cltk) repository.
Named Entity Recognition¶
There is available a simple interface to a list of Latin proper nouns (see repo for how it the list was created). By default tag_ner()
takes a string input and returns a list of tuples. However it can also take pre-tokenized forms and return a string.
In [1]: from cltk.tag import ner
In [2]: from cltk.stem.latin.j_v import JVReplacer
In [3]: text_str = """ut Venus, ut Sirius, ut Spica, ut aliae quae primae dicuntur esse mangitudinis."""
In [4]: jv_replacer = JVReplacer()
In [5]: text_str_iu = jv_replacer.replace(text_str)
In [7]: ner.tag_ner('latin', input_text=text_str_iu, output_type=list)
Out[7]:
[('ut',),
('Uenus', 'Entity'),
(',',),
('ut',),
('Sirius', 'Entity'),
(',',),
('ut',),
('Spica', 'Entity'),
(',',),
('ut',),
('aliae',),
('quae',),
('primae',),
('dicuntur',),
('esse',),
('mangitudinis',),
('.',)]
PHI Indices¶
Located at cltk/corpus/latin/phi5_index.py
of the source are indices for the PHI5, one of just id and name (PHI5_INDEX
) and another also containing information on the authors’ works (PHI5_WORKS_INDEX
).
In [1]: from cltk.corpus.latin.phi5_index import PHI5_INDEX
In [2]: PHI5_INDEX
Out[2]:
{'LAT1050': 'Lucius Verginius Rufus',
'LAT2335': 'Anonymi de Differentiis [Fronto]',
'LAT1345': 'Silius Italicus',
... }
In [3]: from cltk.corpus.latin.phi5_index import PHI5_WORKS_INDEX
In [4]: PHI5_WORKS_INDEX
Out [4]:
{'LAT2335': {'works': ['001'], 'name': 'Anonymi de Differentiis [Fronto]'},
'LAT1345': {'works': ['001'], 'name': 'Silius Italicus'},
'LAT1351': {'works': ['001', '002', '003', '004', '005'],
'name': 'Cornelius Tacitus'},
'LAT2349': {'works': ['001', '002', '003', '004', '005', '006', '007'],
'name': 'Maurus Servius Honoratus, Servius'},
...}
In addition to these indices there are several helper functions which will build filepaths for your particular computer. Not that you will need to have run convert_corpus(corpus='phi5')
and divide_works('phi5')
from the TLGU()
class, respectively, for the following two functions.
In [1]: from cltk.corpus.utils.formatter import assemble_phi5_author_filepaths
In [2]: assemble_phi5_author_filepaths()
Out[2]:
['/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0636.TXT',
'/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0658.TXT',
'/Users/kyle/cltk_data/latin/text/phi5/plaintext/LAT0827.TXT',
...]
In [3]: from cltk.corpus.utils.formatter import assemble_phi5_works_filepaths
In [4]: assemble_phi5_works_filepaths()
Out[4]:
['/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0636.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0902.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-001.txt',
'/Users/kyle/cltk_data/latin/text/phi5/individual_works/LAT0472.TXT-002.txt',
...]
These two functions are useful when, for example, needing to process all authors of the PHI5 corpus, all works of the corpus, or all works of one particular author.
POS tagging¶
These taggers were built with the assistance of the NLTK. The backoff tagger is Bayseian and the TnT is HMM. To obtain the models, first import the latin_models_cltk
corpus.
1–2–3–gram backoff tagger¶
In [1]: from cltk.tag.pos import POSTag
In [2]: tagger = POSTag('latin')
In [3]: tagger.tag_ngram_123_backoff('Gallia est omnis divisa in partes tres')
Out[3]:
[('Gallia', None),
('est', 'V3SPIA---'),
('omnis', 'A-S---MN-'),
('divisa', 'T-PRPPNN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
TnT tagger¶
In [4]: tagger.tag_tnt('Gallia est omnis divisa in partes tres')
Out[4]:
[('Gallia', 'Unk'),
('est', 'V3SPIA---'),
('omnis', 'N-S---MN-'),
('divisa', 'T-SRPPFN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
CRF tagger¶
Warning
This tagger’s accuracy has not yet been evaluated.
We use the NLTK’s CRF tagger. For information on it, see the NLTK docs.
In [5]: tagger.tag_crf('Gallia est omnis divisa in partes tres')
Out[5]:
[('Gallia', 'A-P---NA-'),
('est', 'V3SPIA---'),
('omnis', 'A-S---FN-'),
('divisa', 'N-S---FN-'),
('in', 'R--------'),
('partes', 'N-P---FA-'),
('tres', 'M--------')]
Lapos tagger¶
Note
The Lapos tagger is available in its own repo, with with the master
branch for Linux and apple
branch for Mac. See directions there on how to use it.
Prosody Scanning¶
A prosody scanner is available for text which already has had its natural lengths marked with macrons. It returns a list of strings of long and short marks for each sentence, with an anceps marking the last syllable of each sentence.
The algorithm is designed only for Latin prose rhythms. It is detailed in Keeline, T. and Kirby, J “Auceps syllabarum: A Digital Analysis of Latin Prose Rhythm,” Journal of Roman Studies, 2019.
In [1]: from cltk.prosody.latin.scanner import Scansion
In [2]: scanner = Scansion()
In [3]: text = 'quō usque tandem abūtēre, Catilīna, patientiā nostrā. quam diū etiam furor iste tuus nōs ēlūdet.'
In [4]: scanner.scan_text(text)
Out[4]: ['-uuu-uuu-u--x', 'uu-uu-uu----x']
Scansion of Poetry¶
About the use of macrons in poetry¶
Most Latin poetry arrives to us without macrons. Some lines of Latin poetry can be scanned and fit a poetic meter without any macrons at all, due to the rules of meter and positional accentuation.
Automatically macronizing every word in a line of Latin poetry does not mean that it will automatically scan correctly. Poets often diverge from standard usage: regularly long vowels can appear short; the verb nesciō in poetry scans the final personal ending as a short o; and regularly short vowels can appear as long; e.g. Lucretius regularly writes rēligiō which scans, instead of the usual religiō; and there is a prosody device: diastole - the short final vowel of a word is lengthened to fit the meter; e.g. tibī in Lucretius I.104 and III.899, etc.
However, some macrons are necessary for scansion: Lucretius I.12 begins with “aeriae” which will not scan in hexameter unless one substitutes its macronized form “āeriae”.
HexameterScanner¶
The HexameterScanner class scans lines of Latin hexameter (with or without macrons) and determines if the line is a valid hexameter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
- Is complete (e.g. some hexameter lines are partial).
The scanner also determines which syllables would have to be made long to make the line scan as a valid hexameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HexameterScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.hexameter_scanner import HexameterScanner
In [2]: scanner = HexameterScanner()
In [3]: scanner.scan("impulerit. Tantaene animis caelestibus irae?")
Out[3]: Verse(original='impulerit. Tantaene animis caelestibus irae?', scansion='- U U - - - U U - - - U U - - ', meter='hexameter', valid=True, syllable_count=15, accented='īmpulerīt. Tāntaene animīs caelēstibus īrae?', scansion_notes=['Valid by positional stresses.'], syllables = ['īm', 'pu', 'le', 'rīt', 'Tān', 'taen', 'a', 'ni', 'mīs', 'cae', 'lēs', 'ti', 'bus', 'i', 'rae'])
PentameterScanner¶
The PentameterScanner class scans lines of Latin pentameter (with or without macrons) and determines if the line is a valid pentameter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
The scanner also determines which syllables would have to be made long to make the line scan as a valid pentameter. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The PentameterScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.pentameter_scanner import PentameterScanner
In [2]: scanner = PentameterScanner()
In [3]: scanner.scan("ex hoc ingrato gaudia amore tibi.")
Out[3]: Verse(original='ex hoc ingrato gaudia amore tibi.', scansion='- - - - - - U U - U U U ', meter='pentameter', valid=True, syllable_count=12, accented='ēx hōc īngrātō gaudia amōre tibi.', scansion_notes=['Spondaic pentameter'], syllables = ['ēx', 'hoc', 'īn', 'gra', 'to', 'gau', 'di', 'a', 'mo', 're', 'ti', 'bi'])
HendecasyllableScanner¶
The HendecasyllableScanner class scans lines of Latin hendecasyllables (with or without macrons) and determines if the line is a valid example of the hendecasyllablic meter and what its scansion pattern is.
If the line is not properly macronized to scan, the scanner tries to determine whether the line:
- Scans merely by position.
- Syllabifies according to the common rules.
The scanner also determines which syllables would have to be made long to make the line scan as a valid hendecasyllables. The scanner records scansion_notes about which transformations had to be made to the line of verse to get it to scan. The HendecasyllableScanner’s scan method returns a Verse class object.
In [1]: from cltk.prosody.latin.hendecasyllable_scanner import HendecasyllableScanner
In [2]: scanner = HendecasyllableScanner()
In [3]: scanner.scan("Iam tum, cum ausus es unus Italorum")
Out[3]: Verse(original='Iam tum, cum ausus es unus Italorum', scansion=' - - - U U - U - U - U ', meter='hendecasyllable', valid=True, syllable_count=11, accented='Iām tūm, cum ausus es ūnus Ītalōrum', scansion_notes=['antepenult foot onward normalized.'], syllables = ['Jām', 'tūm', 'c', 'au', 'sus', 'es', 'u', 'nus', 'I', 'ta', 'lo', 'rum'])
Verse¶
The Verse class object returned by the HexameterScanner, PentameterScanner, and HendecasyllableScanner provides slots for:
- original - original line of verse
- scansion - the scansion pattern
- meter - the meter of the verse
- valid - whether or not the hexameter is valid
- syllable_count - number of syllables according to common syllabification rules
- accented - if the hexameter is valid, a version of the line with accented vowels (dipthongs are not accented)
- scansion_notes - a list recording the characteristics of the transformations made to the original line
- syllables - a list of syllables of which the line is divided into at the scansion level; elided syllables are not provided.
The Scansion notes are defined in a NOTE_MAP dictionary object contained in the ScansionConstants class.
ScansionConstants¶
The ScansionConstants class is a configuration class for specifying scansion constants. This class also allows users to customizing scansion constants and scanner behavior, for example, a user may alter the symbols used for stressed and unstressed syllables:
In [1]: from cltk.prosody.latin.scansion_constants import ScansionConstants
In [2]: constants = ScansionConstants(unstressed="U",stressed= "-", optional_terminal_ending="X")
In [3]: constants.DACTYL
Out[3]: '-UU'
In [4]: smaller_constants = ScansionConstants(unstressed="˘",stressed= "¯", optional_terminal_ending="x")
In [5]: smaller_constants.DACTYL
Out[5]: '¯˘˘'
Constants containing strings have characters in upper and lower case since they will often be used in regular expressions, and used to preserve/a verse’s original case.
Syllabifier¶
The Syllabifier class is a Latin language syllabifier. It parses a Latin word or a space separated list of words into a list of syllables. Consonantal I is transformed into a J at the start of a word as necessary. Tuned for poetry and verse, this class is tolerant of isolated single character consonants that may appear due to elision.
In [1]: from cltk.prosody.latin.syllabifier import Syllabifier
In [1]: syllabifier = Syllabifier()
In [2]: syllabifier.syllabify("libri")
Out[2]: ['li', 'bri']
In [3]: syllabifier.syllabify("contra")
Out[3]: ['con', 'tra']
Metrical Validator¶
The MetricalValidator class is a utility class for validating scansion patterns. Users may configure the scansion symbols internally via passing a customized ScansionConstants via a constructor argument:
In [1]: from cltk.prosody.latin.metrical_validator import MetricalValidator
In [2]: MetricalValidator().is_valid_hexameter("-UU---UU---UU-U")
Out[2]: 'True'
ScansionFormatter¶
The ScansionFormatter class is a utility class for formatting scansion patterns.
In [1]: from cltk.prosody.latin.scansion_formatter import ScansionFormatter
In [2]: ScansionFormatter().hexameter("-UU-UU-UU---UU--")
Out[2]: '-UU|-UU|-UU|--|-UU|--'
In [3]: constants = ScansionConstants(unstressed="˘", stressed= "¯", optional_terminal_ending="x")
In [4]: formatter = ScansionFormatter(constants)
In [5]: formatter.hexameter( "¯˘˘¯˘˘¯˘˘¯¯¯˘˘¯¯")
Out[5]: '¯˘˘|¯˘˘|¯˘˘|¯¯|¯˘˘|¯¯'
string_utils module¶
The string_utils module contains utility methods for processing scansion and text. Such as punctuation_for_spaces_dict()
returns a dictionary object that maps unicode punctuation to blanks spaces, which are essential for scansion to keep stress patterns in alignment with original vowel positions in the verse.
In [1]: import cltk.prosody.latin.string_utils as string_utils
In [2]: "I'm ok! Oh #%&*()[]{}!? Fine!".translate(string_utils.punctuation_for_spaces_dict()).strip()
Out[2]: 'I m ok Oh Fine'
Semantics¶
The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>
Tip
When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.
The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.
In [1]: from cltk.semantics.latin.lookup import Lemmata
In [2]: lemmatizer = Lemmata(dictionary = 'lemmata', language = 'latin')
In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']
In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]
In [5]: justlemmas = lemmatizer.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.
In [1]: from cltk.semantics.latin.lookup import Synonyms
In [2]: translator = Synonyms(dictionary = 'translations', language = 'latin')
In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]
A raw list of translations can be obtained from the translation object using Lemmata.isolate().
Sentence Tokenization¶
Sentence tokenization for Latin is available using a [Punkt](https://www.nltk.org/_modules/nltk/tokenize/punkt.html) tokenizer trained on the Latin Library. The model for this tokenizer can be found in the CLTK corpora under latin_model_cltk/tokenizers/sentence/latin_punkt. The training process considers Latin punctuation patterns as well as common abbreviations (e.g. nomina). To tokenize a Latin text by sentences…
In [1]: from cltk.tokenize.latin.sentence import SentenceTokenizer
In [2]: sent_tokenizer = SentenceTokenizer()
In [3]: untokenized_text = 'Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae? Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies? Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.'
In [4]: sent_tokenizer.tokenize(untokenized_text)
Out[4]: ['Meministine me ante diem XII Kalendas Novembris dicere in senatu fore in armis certo die, qui dies futurus esset ante diem VI Kal. Novembris, C. Manlium, audaciae satellitem atque administrum tuae?', 'Num me fefellit, Catilina, non modo res tanta, tam atrox tamque incredibilis, verum, id quod multo magis est admirandum, dies?', 'Dixi ego idem in senatu caedem te optumatium contulisse in ante diem V Kalendas Novembris, tum cum multi principes civitatis Roma non tam sui conservandi quam tuorum consiliorum reprimendorum causa profugerunt.']
Note that the Latin sentence tokenizer takes account of abbreviations like ‘Kal.’ and ‘C.’ and does not split sentences at these points.
By default, the Latin Punkt Sentence Tokenizer splits on period, question mark, and exclamation point. There is a `strict`
parameter that adds colon, semicolon, and hyphen to this.
In [5]: sent_tokenizer = SentenceTokenizer(strict=True)
In [6]: untokenized_text = ‘In principio creavit Deus caelum et terram; terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas; dixitque Deus fiat lux et facta est lux; et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’
In [7]: sent_tokenizer.tokenize(untokenized_text) Out[7]: [‘In principio creavit Deus caelum et terram;’, ‘terra autem erat inanis et vacua et tenebrae super faciem abyssi et spiritus Dei ferebatur super aquas;’, ‘dixitque Deus fiat lux et facta est lux;’, ‘et vidit Deus lucem quod esset bona et divisit lucem ac tenebras.’]
NB: The old method for sentence tokenizer, i.e. TokenizeSentence, is still available, but now calls the tokenizer described above.
In [5]: from cltk.tokenize.sentence import TokenizeSentence
In [6]: tokenizer = TokenizeSentence('latin')
etc.
Semantics¶
The Semantics module allows for the lookup of Latin lemmata, synonyms, and translations into Greek. Lemma, synonym, and translation dictionaries are drawn from the open-source Tesserae Project<http://github.com/tesserae/tesserae>
The dictionaries used by this module are stored in https://github.com/cltk/latin_models_cltk/tree/master/semantics and https://github.com/cltk/greek_models_cltk/tree/master/semantics for Greek and Latin, respectively. In order to use the Semantics module, it is necessary to import those repos first<http://docs.cltk.org/en/latest/importing_corpora.html#importing-a-corpus>.
Tip
When lemmatizing ambiguous forms, the Semantics module is designed to return all possibilities. A probability distribution is included with the list of results, but as of June 8, 2018 the total probability is evenly distributed over all possibilities. Future updates will include a more intelligent system for determining the most likely lemma, synonym, or translation._.
The Lemmata class includes two relevant methods: lookup() takes a list of tokens standardized for spelling and returns a complex object which includes a probability distribution; isolate() takes the object returned by lookup() and discards everything but the lemmata.
In [1]: from cltk.semantics.latin.lookup import Lemmata
In [2]: lemmatizer = Lemmata(dictionary='lemmata', language='latin')
In [3]: tokens = ['ceterum', 'antequam', 'destinata', 'componam']
In [4]: lemmas = lemmatizer.lookup(tokens)
Out[4]:
[('ceterum', [('ceterus', 1.0)]), ('antequam', [('antequam', 1.0)]), ('destinata', [('destinatus', 0.25), ('destinatum', 0.25), ('destinata', 0.25), ('destino', 0.25)]), ('componam', [('compono', 1.0)])]
In [5]: just_lemmas = Lemmata.isolate(lemmas)
Out[5]:['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
The Synonym class can be initialized to lookup either synonyms or translations. It expects a list of lemmata, not inflected forms. Only successful ‘lookups’ will return results.
In [1]: from cltk.semantics.latin.lookup import Synonyms
In [2]: translator = Synonyms(dictionary='translations', language='latin')
In [3]: lemmas = ['ceterus', 'antequam', 'destinatus', 'destinatum', 'destinata', 'destino', 'compono']
In [4]: translations = translator.lookup(lemmas)
Out[4]:[('destino', [('σκοπός', 1.0)]), ('compono', [('συντίθημι', 1.0)])]
In [5]: just_translations = Lemmata.isolate(translations)
Out[5]:['σκοπός', 'συντίθημι']
A raw list of translations can be obtained from the translation object using Lemmata.isolate().
Stemming¶
The stemmer strips suffixes via an algorithm. It is much faster than the lemmatizer, which uses a replacement list.
In [1]: from cltk.stem.latin.stem import Stemmer
In [2]: sentence = 'Est interdum praestare mercaturis rem quaerere, nisi tam periculosum sit, et item foenerari, si tam honestum. Maiores nostri sic habuerunt et ita in legibus posiuerunt: furem dupli condemnari, foeneratorem quadrupli. Quanto peiorem ciuem existimarint foeneratorem quam furem, hinc licet existimare. Et uirum bonum quom laudabant, ita laudabant: bonum agricolam bonumque colonum; amplissime laudari existimabatur qui ita laudabatur. Mercatorem autem strenuum studiosumque rei quaerendae existimo, uerum, ut supra dixi, periculosum et calamitosum. At ex agricolis et uiri fortissimi et milites strenuissimi gignuntur, maximeque pius quaestus stabilissimusque consequitur minimeque inuidiosus, minimeque male cogitantes sunt qui in eo studio occupati sunt. Nunc, ut ad rem redeam, quod promisi institutum principium hoc erit.'
In [3]: stemmer = Stemmer()
In [4]: stemmer.stem(sentence.lower())
Out[4]: 'est interd praestar mercatur r quaerere, nisi tam periculos sit, et it foenerari, si tam honestum. maior nostr sic habueru et ita in leg posiuerunt: fur dupl condemnari, foenerator quadrupli. quant peior ciu existimari foenerator quam furem, hinc lice existimare. et uir bon quo laudabant, ita laudabant: bon agricol bon colonum; amplissim laudar existimaba qui ita laudabatur. mercator autem strenu studios re quaerend existimo, uerum, ut supr dixi, periculos et calamitosum. at ex agricol et uir fortissim et milit strenuissim gignuntur, maxim p quaest stabilissim consequi minim inuidiosus, minim mal cogitant su qui in e studi occupat sunt. nunc, ut ad r redeam, quod promis institut principi hoc erit. '
Stoplist Construction¶
To extract a stoplist from a collection of documents:
In [1]: test_1 = """cogitanti mihi saepe numero et memoria vetera repetenti perbeati fuisse, quinte frater, illi videri solent, qui in optima re publica, cum et honoribus et rerum gestarum gloria florerent, eum vitae cursum tenere potuerunt, ut vel in negotio sine periculo vel in otio cum dignitate esse possent; ac fuit cum mihi quoque initium requiescendi atque animum ad utriusque nostrum praeclara studia referendi fore iustum et prope ab omnibus concessum arbitrarer, si infinitus forensium rerum labor et ambitionis occupatio decursu honorum, etiam aetatis flexu constitisset. quam spem cogitationum et consiliorum meorum cum graves communium temporum tum varii nostri casus fefellerunt; nam qui locus quietis et tranquillitatis plenissimus fore videbatur, in eo maximae moles molestiarum et turbulentissimae tempestates exstiterunt; neque vero nobis cupientibus atque exoptantibus fructus oti datus est ad eas artis, quibus a pueris dediti fuimus, celebrandas inter nosque recolendas. nam prima aetate incidimus in ipsam perturbationem disciplinae veteris, et consulatu devenimus in medium rerum omnium certamen atque discrimen, et hoc tempus omne post consulatum obiecimus eis fluctibus, qui per nos a communi peste depulsi in nosmet ipsos redundarent. sed tamen in his vel asperitatibus rerum vel angustiis temporis obsequar studiis nostris et quantum mihi vel fraus inimicorum vel causae amicorum vel res publica tribuet oti, ad scribendum potissimum conferam; tibi vero, frater, neque hortanti deero neque roganti, nam neque auctoritate quisquam apud me plus valere te potest neque voluntate."""
In [2] test_2 = """ac mihi repetenda est veteris cuiusdam memoriae non sane satis explicata recordatio, sed, ut arbitror, apta ad id, quod requiris, ut cognoscas quae viri omnium eloquentissimi clarissimique senserint de omni ratione dicendi. vis enim, ut mihi saepe dixisti, quoniam, quae pueris aut adulescentulis nobis ex commentariolis nostris incohata ac rudia exciderunt, vix sunt hac aetate digna et hoc usu, quem ex causis, quas diximus, tot tantisque consecuti sumus, aliquid eisdem de rebus politius a nobis perfectiusque proferri; solesque non numquam hac de re a me in disputationibus nostris dissentire, quod ego eruditissimorum hominum artibus eloquentiam contineri statuam, tu autem illam ab elegantia doctrinae segregandam putes et in quodam ingeni atque exercitationis genere ponendam. ac mihi quidem saepe numero in summos homines ac summis ingeniis praeditos intuenti quaerendum esse visum est quid esset cur plures in omnibus rebus quam in dicendo admirabiles exstitissent; nam quocumque te animo et cogitatione converteris, permultos excellentis in quoque genere videbis non mediocrium artium, sed prope maximarum. quis enim est qui, si clarorum hominum scientiam rerum gestarum vel utilitate vel magnitudine metiri velit, non anteponat oratori imperatorem? quis autem dubitet quin belli duces ex hac una civitate praestantissimos paene innumerabilis, in dicendo autem excellentis vix paucos proferre possimus? iam vero consilio ac sapientia qui regere ac gubernare rem publicam possint, multi nostra, plures patrum memoria atque etiam maiorum exstiterunt, cum boni perdiu nulli, vix autem singulis aetatibus singuli tolerabiles oratores invenirentur. ac ne qui forte cum aliis studiis, quae reconditis in artibus atque in quadam varietate litterarum versentur, magis hanc dicendi rationem, quam cum imperatoris laude aut cum boni senatoris prudentia comparandam putet, convertat animum ad ea ipsa artium genera circumspiciatque, qui in eis floruerint quamque multi sint; sic facillime, quanta oratorum sit et semper fuerit paucitas, iudicabit."""
In [3]: test_corpus = [test_1, test_2]
In [4]: from cltk.stop.latin import CorpusStoplist
In [5]: S = CorpusStoplist()
In [6]: print(S.build_stoplist(test_corpus, size=10))
Out [6]: ['ac', 'atque', 'cum', 'et', 'in', 'mihi', 'neque', 'qui', 'rerum', 'vel']
Stopword Filtering¶
To use a pre-built stoplist (created originally by the Perseus Project):
In [1]: from nltk.tokenize.punkt import PunktLanguageVars
In [2]: from cltk.stop.latin.stops import STOPS_LIST
In [3]: sentence = 'Quo usque tandem abutere, Catilina, patientia nostra?'
In [4]: p = PunktLanguageVars()
In [5]: tokens = p.word_tokenize(sentence.lower())
In [6]: [w for w in tokens if not w in STOPS_LIST]
Out[6]:
['usque',
'tandem',
'abutere',
',',
'catilina',
',',
'patientia',
'nostra',
'?']
Swadesh¶
The corpus module has a class for generating a Swadesh list for Latin.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('la')
In [3]: swadesh.words()[:10]
Out[3]: ['ego', 'tū', 'is, ea, id', 'nōs', 'vōs', 'eī, iī, eae, ea', 'hic, haec, ho', 'ille, illa, illud', 'hīc', 'illic, ibi']
Syllabifier¶
The syllabifier splits a given input Latin word into a list of syllables based on an algorithm and set of syllable specifications for Latin.
In [1]: from cltk.stem.latin.syllabifier import Syllabifier
In [2]: word = 'sidere'
In [3]: syllabifier = Syllabifier()
In [4]: syllabifier.syllabify(word)
Out[4]: ['si', 'de', 're']
Text Cleanup¶
Intended for use on the TLG after processing by TLGU()
.
In [1]: from cltk.corpus.utils.formatter import phi5_plaintext_cleanup
In [2]: import os
In [3]: file = os.path.expanduser('~/cltk_data/latin/text/phi5/individual_works/LAT0031.TXT-001.txt')
In [4]: with open(file) as f:
...: r = f.read()
...:
In [5]: r[:500]
Out[5]: '\nDices pulchrum esse inimicos \nulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uide-\ntur, sed si liceat re publica salua ea persequi. sed quatenus id fieri non \npotest, multo tempore multisque partibus inimici nostri non peribunt \natque, uti nunc sunt, erunt potius quam res publica profligetur atque \npereat. \n Verbis conceptis deierare ausim, praeterquam qui \nTiberium Gracchum necarunt, neminem inimicum tantum molestiae \ntantumque laboris, quantum te ob has res, mihi tradidis'
In [6]: phi5_plaintext_cleanup(r, rm_punctuation=True, rm_periods=False)[:500]
Out[7]: ' Dices pulchrum esse inimicos ulcisci. id neque maius neque pulchrius cuiquam atque mihi esse uidetur sed si liceat re publica salua ea persequi. sed quatenus id fieri non potest multo tempore multisque partibus inimici nostri non peribunt atque uti nunc sunt erunt potius quam res publica profligetur atque pereat. Verbis conceptis deierare ausim praeterquam qui Tiberium Gracchum necarunt neminem inimicum tantum molestiae tantumque laboris quantum te ob has res mihi tradidisse quem oportebat omni'
If you have a text of a language in Latin characters which contain a lot of junk, remove_non_ascii()
and remove_non_latin()
might be of use.
In [1]: from cltk.corpus.utils.formatter import remove_non_ascii
In [2]: text = 'Dices ἐστιν ἐμός pulchrum esse inimicos ulcisci.'
In [3]: remove_non_ascii(text)
Out[3]: 'Dices pulchrum esse inimicos ulcisci.
In [4]: from cltk.corpus.utils.formatter import remove_non_latin
In [5]: remove_non_latin(text)
Out[5]: ' Dices pulchrum esse inimicos ulcisci'
In [6]: remove_non_latin(text, also_keep=['.', ','])
Out[6]: ' Dices pulchrum esse inimicos ulcisci.'
Transliteration¶
The CLTK provides IPA phonetic transliteration for the Latin language. Currently, the only available dialect is Classical as reconstructed by W. Sidney Allen (taken from Vox Latina, 85-103). Example:
In [1]: from cltk.phonology.latin.transcription import Transcriber
In [2]: transcriber = Transcriber(dialect="Classical", reconstruction="Allen")
In [3]: transcriber.transcribe("Quo usque tandem, O Catilina, abutere nostra patientia?")
Out[3]: "['kʷoː 'ʊs.kʷɛ 't̪an̪.d̪ẽː 'oː ka.t̪ɪ.'liː.n̪aː a.buː.'t̪eː.rɛ 'n̪ɔs.t̪raː pa.t̪ɪ̣.'jɛn̪.t̪ɪ̣.ja]"
Word Tokenization¶
In [1]: from cltk.tokenize.word import WordTokenizer
In [2]: word_tokenizer = WordTokenizer('latin')
In [3]: text = 'atque haec abuterque puerve paterne nihil'
In [4]: word_tokenizer.tokenize(text)
Out[4]: ['atque', 'haec', 'abuter', '-que', 'puer', '-ve', 'pater', '-ne', 'nihil']
Word2Vec¶
Note
The Word2Vec models have not been fully vetted and are offered in the spirit of a beta. The CLTK’s API for it will be revised.
Note
You will need to install Gensim to use these features.
Word2Vec is a Vector space model especially powerful for comparing words in relation to each other. For instance, it is commonly used to discover words which appear in similar contexts (something akin to synonyms; think of them as lexical clusters).
The CLTK repository contains pre-trained Word2Vec models for Latin (import as latin_word2vec_cltk
), one lemmatized and the other not. They were trained on the PHI5 corpus. To train your own, see the README at the Latin Word2Vec repository.
One of the most common uses of Word2Vec is as a keyword expander. Keyword expansion is the taking of a query term, finding synonyms, and searching for those, too. Here’s an example of its use:
In [1]: from cltk.ir.query import search_corpus
In [2]: for x in search_corpus('amicitia', 'phi5', context='sentence', case_insensitive=True, expand_keyword=True, threshold=0.25):
print(x)
...:
Out[2]: The following similar terms will be added to the 'amicitia' query: '['societate', 'praesentia', 'uita', 'sententia', 'promptu', 'beneuolentia', 'dignitate', 'monumentis', 'somnis', 'philosophia']'.
('L. Iunius Moderatus Columella', 'hospitem, nisi ex *amicitia* domini, quam raris-\nsime recipiat.')
('L. Iunius Moderatus Columella', ' \n Xenophon Atheniensis eo libro, Publi Siluine, qui Oeconomicus \ninscribitur, prodidit maritale coniugium sic comparatum esse \nnatura, ut non solum iucundissima, uerum etiam utilissima uitae \nsocietas iniretur: nam primum, quod etiam Cicero ait, ne genus \nhumanum temporis longinquitate occideret, propter \nhoc marem cum femina esse coniunctum, deinde, ut ex \nhac eadem *societate* mortalibus adiutoria senectutis nec \nminus propugnacula praeparentur.')
('L. Iunius Moderatus Columella', 'ac ne ista quidem \npraesidia, ut diximus, non adsiduus labor et experientia \nuilici, non facultates ac uoluntas inpendendi tantum pollent \nquantum uel una *praesentia* domini, quae nisi frequens \noperibus interuenerit, ut in exercitu, cum abest imperator, \ncuncta cessant officia.')
['…']
threshold
is the closeness of the query term to its neighboring words. Note that when expand_keyword=True
, the search term will be stripped of any regular expression syntax.
The keyword expander leverages get_sims()
(which in turn leverages functionality of the Gensim package) to find similar terms. Some examples of it in action:
In [3]: from cltk.vector.word2vec import get_sims
In [4]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.7)
Matches found, but below the threshold of 'threshold=0.7'. Lower it to see these results.
Out[4]: []
In [5]: get_sims('iubeo', 'latin', lemmatized=True, threshold=0.2)
Out[5]:
['lictor',
'extemplo',
'cena',
'nuntio',
'aduenio',
'iniussus2',
'forum',
'dictator',
'fabium',
'caesarem']
In [6]: get_sims('iube', 'latin', lemmatized=True, threshold=0.7)
Out[6]: "word 'iube' not in vocabulary"
['The following terms in the Word2Vec model you may be looking for: '['iubet”', 'iubet', 'iubilo', 'iubĕ', 'iubar', 'iubes', 'iubatus', 'iuba1', 'iubeo']'.]'
In [7]: get_sims('dictator', 'latin', lemmatized=False, threshold=0.7)
Out[7]:
['consul',
'caesar',
'seruilius',
'praefectus',
'flaccus',
'manlius',
'sp',
'fuluius',
'fabio',
'ualerius']
To add and subtract vectors, you need to load the models yourself with Gensim.