8.1.13. cltk.readers package

8.1.13.1. Submodules

8.1.13.2. cltk.readers.latin_library_corpus_types module

latin_library_corpus_types - a mapping of corpus types into common periods, based largely on: https://en.wikipedia.org/wiki/Latin_literature and some personal choices, e.g.: the inscrutable Twelve Tables is placed in an ‘early’ lat classification, while Plautus and Terence are in the Old lat section, some uncertain items are binned into ‘misc’. Pull requests to further sort this out are welcome!

8.1.13.3. cltk.readers.perseus_corpus_types module

perseus_corpus_types - a mapping of corpus types into common periods, based largely on: https://en.wikipedia.org/wiki/Latin_literature

8.1.13.4. cltk.readers.phi5_index module

Indices to the PHI5 Latin corpus.

8.1.13.5. cltk.readers.readers module

reader.py - Corpus reader utility objects.

cltk.readers.readers.get_corpus_reader(corpus_name=None, language=None)[source]

Corpus reader factory method :type corpus_name: Optional[str] :param corpus_name: the name of the supported corpus, available as: [package].SUPPORTED_CORPORA :param langugage: the language for search in :rtype: CorpusReader :return: NLTK compatible corpus reader

cltk.readers.readers.assemble_corpus(corpus_reader, types_requested, type_dirs=None, type_files=None)[source]

Create a filtered corpus. :type corpus_reader: CorpusReader :param corpus_reader: This get mutated :type types_requested: List[str] :param types_requested: a list of string types, which are to be found in the type_dirs and type_files mappings :type type_dirs: Optional[Dict[str, List[str]]] :param type_dirs: a dict of corpus types to directories :type type_files: Optional[Dict[str, List[str]]] :param type_files: a dict of corpus types to files :rtype: CorpusReader :return: a CorpusReader object containing only the mappings desired

class cltk.readers.readers.FilteredPlaintextCorpusReader(root, fileids=None, encoding='utf8', skip_keywords=None, **kwargs)[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader, nltk.corpus.reader.api.CorpusReader

A corpus reader for plain text documents with simple filtration for streamlined pipeline use. A list keywords may be provided, and if any of these keywords are found in a document’s paragraph, that whole paragraph will be skipped, same for sentences and words.

words(fileids=None)[source]

Provide the words of the corpus; skipping any paragraphs flagged by keywords to the main class constructor :param fileids: :rtype: Generator[str, str, None] :return: words, including punctuation, one by one

paras(fileids=None)[source]

Provide paragraphs, if possible :param fileids: :rtype: Generator[str, str, None] :return: a generator of paragraphs

sents(fileids=None)[source]

A generator for sentences in a text, or texts :param fileids: :rtype: Generator[str, str, None] :return: a generator of sentences

docs(fileids=None)[source]

Returns the complete text of an Text document, closing the document after we are done reading it and yielding it in a memory safe fashion.

Return type

Generator[str, str, None]

sizes(fileids=None)[source]

Returns a list of tuples, the fileid and size on disk of the file. This function is used to detect oddly large files in the corpus.

Return type

Generator[int, int, None]

class cltk.readers.readers.JsonfileCorpusReader(root, fileids=None, encoding='utf8', skip_keywords=None, target_language=None, paragraph_separator='\n\n', **kwargs)[source]

Bases: nltk.corpus.reader.api.CorpusReader

A corpus reader for Json documents where contents are stored in a dictionary. Supports any documents stored under a text key. A document may have any number of subsections as nested dictionaries, as long as their keys are sortable; they will be traversed and only strings datatypes will be collected as the text. E.g.:

doc[‘text’][‘1’] = “some text” doc[‘text’][‘2’] = “more text” Or with one level of subsections: doc[‘text’][‘1’][‘1’] = “some text” doc[‘text’][‘1’][‘2’] = “more text”

words(fileids=None)[source]

Provide the words of the corpus; skipping any paragraphs flagged by keywords to the main class constructor :param fileids: :rtype: Generator[str, str, None] :return: words, including punctuation, one by one

sents(fileids=None)[source]
Parameters

fileids

Return type

Generator[str, str, None]

Returns

A generator of sentences

paras(fileids=None)[source]

Yield paragraphs of the text, as demarcated by double new lines. :param fileids: single document file or files of proper JSON objects with a text key, and section subkey :rtype: Generator[str, str, None] :return: a generator of paragraphs

docs(fileids=None)[source]

Returns the complete text of an Text document, closing the document after we are done reading it and yielding it in a memory safe fashion. :return : Python Dictionary of strings or Nested Dictionaries. The top level dictionary also contains the filename from which it spawned.

Return type

Generator[Dict[str, Any], Dict[str, Any], None]

sizes(fileids=None)[source]

Returns a list of tuples, the fileid and size on disk of the file. This function is used to detect oddly large files in the corpus.

Return type

Generator[int, int, None]

class cltk.readers.readers.TesseraeCorpusReader(root, fileids=None, encoding='utf8', skip_keywords=None, **kwargs)[source]

Bases: nltk.corpus.reader.plaintext.PlaintextCorpusReader

docs(fileids)[source]

Returns the complete text of a .tess file, closing the document after we are done reading it and yielding it in a memory-safe fashion.

texts(fileids, plaintext=True)[source]

Returns the text content of a .tess file, i.e. removing the bracketed citation info (e.g. “<Ach. Tat. 1.1.0>”)

paras(fileids)[source]

Returns paragraphs in a .tess file, as defined by two

characters.

NB: Most .tess files do not have this feature; only the Homeric poems from what I have noticed so far. Perhaps a feature worth looking into.

lines(fileids, plaintext=True)[source]

Tokenizes documents in the corpus by line

sents(fileids)[source]

Tokenizes documents in the corpus by sentence

words(fileids)[source]

Tokenizes documents in the corpus by word

pos_tokenize(fileids)[source]

Segments, tokenizes, and POS tag a document in the corpus.

describe(fileids=None)[source]

Performs a single pass of the corpus and returns a dictionary with a variety of metrics concerning the state of the corpus.

based on (Bengfort et al, 2018: 46)