2. Quickstart

2.1. Installation

Install via Pip:

$ pip install cltk

2.2. Use

cltk.nlp.NLP() has pre-configured processing pipelines for a number of Languages. Executing cltk.nlp.NLP.analyze() returns a cltk.core.data_types.Doc object, which contains all processed information.

To process text:

>>> from cltk import NLP
>>> vitruvius = "Architecti est scientia pluribus disciplinis et variis eruditionibus ornata, quae ab ceteris artibus perficiuntur. Opera ea nascitur et fabrica et ratiocinatione."
>>> cltk_nlp = NLP(language="lat")
‎𐤀 CLTK version 'cltk 1.0.0b10'.
Pipeline for language 'Latin' (ISO: 'lat'): `LatinNormalizeProcess`, `LatinStanzaProcess`, `LatinEmbeddingsProcess`, `StopsProcess`, `LatinNERProcess`, `LatinLexiconProcess`.
>>> cltk_doc = cltk_nlp.analyze(text=vitruvius)

Some NLP Process require downloaded models, for which you will be prompted to download. You may then inspect the output Doc, which contains the information produced by each Process step:

>>> cltk_doc.tokens[:5]
['Architecti', 'est', 'scientia', 'pluribus', 'disciplinis']
>>> cltk_doc.lemmata[:5]
['mrchiteo', 'sum', 'scientia', 'multus', 'disciplina']
>>> cltk_doc.morphosyntactic_features[2]  # 'scientia'
{Case: [nominative], Degree: [positive], Gender: [feminine], Number: [singular]}
>>> cltk_doc.pos[:5]
['VERB', 'AUX', 'NOUN', 'ADJ', 'NOUN']
>>> cltk_doc.sentences_tokens
[['Architecti', 'est', 'scientia', 'pluribus', 'disciplinis', ...], ...]

Most processes add their information to a list of Word objects at Doc.words:

>>> cltk_doc.words[1].string
'est'
>>> cltk_doc.words[1].stop
True
>>> cltk_doc.words[1].lemma
'sum'
>>> cltk_doc.words[4].definition[:200]
‘disciplīnannn ae, nfnndiscipulus, ninstruction, tuition, teaching, training, educationn: puerilis: adulescentīs in disciplinam ei tradere:n te in disciplinam meam tradere: in disciplina’
>>> cltk_doc.words[4].pos
'NOUN'
>>> cltk_doc.words[4].category
{F: [neg], N: [pos], V: [neg]}
>>> cltk_doc.words[4].features
{Case: [ablative], Degree: [positive], Gender: [feminine], Number: [plural]}
>>> cltk_doc.words[4].dependency_relation
'obl'
>>> cltk_doc.words[4].governor  # this word's "parent"
8
>>> cltk_doc.words[8].string  # looking at this word
'ornata'
>>> cltk_doc.words[4].embedding[:5]
array([-0.10924 , -0.048127,  0.15953 , -0.19465 ,  0.17935 ],
      dtype=float32)
>>> cltk_doc.words[2].embedding[:5]  # 'scientia'
array([-0.28462 ,  0.64238 , -0.40037 ,  0.39382 ,  0.060418],
      dtype=float32)
>>> cltk_doc.words[5].index_sentence  # sentence to which a token belongs
0
>>> cltk_doc.words[20].index_sentence
1

For more, see Pipelines, Processes, Docs, and Words.

2.3. Tutorials

Demonstration notebooks available at https://github.com/cltk/cltk/tree/dev/notebooks.