8.1.13. cltk.readers package¶
8.1.13.1. Submodules¶
8.1.13.2. cltk.readers.latin_library_corpus_types module¶
latin_library_corpus_types - a mapping of corpus types into common periods, based largely on: https://en.wikipedia.org/wiki/Latin_literature and some personal choices, e.g.: the inscrutable Twelve Tables is placed in an ‘early’ lat classification, while Plautus and Terence are in the Old lat section, some uncertain items are binned into ‘misc’. Pull requests to further sort this out are welcome!
8.1.13.3. cltk.readers.perseus_corpus_types module¶
perseus_corpus_types - a mapping of corpus types into common periods, based largely on: https://en.wikipedia.org/wiki/Latin_literature
8.1.13.4. cltk.readers.phi5_index module¶
Indices to the PHI5 Latin corpus.
8.1.13.5. cltk.readers.readers module¶
reader.py - Corpus reader utility objects.
-
cltk.readers.readers.
get_corpus_reader
(corpus_name=None, language=None)[source]¶ Corpus reader factory method :type corpus_name:
Optional
[str
] :param corpus_name: the name of the supported corpus, available as: [package].SUPPORTED_CORPORA :param langugage: the language for search in :rtype:CorpusReader
:return: NLTK compatible corpus reader
-
cltk.readers.readers.
assemble_corpus
(corpus_reader, types_requested, type_dirs=None, type_files=None)[source]¶ Create a filtered corpus. :type corpus_reader:
CorpusReader
:param corpus_reader: This get mutated :type types_requested:List
[str
] :param types_requested: a list of string types, which are to be found in the type_dirs and type_files mappings :type type_dirs:Optional
[Dict
[str
,List
[str
]]] :param type_dirs: a dict of corpus types to directories :type type_files:Optional
[Dict
[str
,List
[str
]]] :param type_files: a dict of corpus types to files :rtype:CorpusReader
:return: a CorpusReader object containing only the mappings desired
-
class
cltk.readers.readers.
FilteredPlaintextCorpusReader
(root, fileids=None, encoding='utf8', skip_keywords=None, **kwargs)[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
,nltk.corpus.reader.api.CorpusReader
A corpus reader for plain text documents with simple filtration for streamlined pipeline use. A list keywords may be provided, and if any of these keywords are found in a document’s paragraph, that whole paragraph will be skipped, same for sentences and words.
-
words
(fileids=None)[source]¶ Provide the words of the corpus; skipping any paragraphs flagged by keywords to the main class constructor :param fileids: :rtype:
Generator
[str
,str
,None
] :return: words, including punctuation, one by one
-
paras
(fileids=None)[source]¶ Provide paragraphs, if possible :param fileids: :rtype:
Generator
[str
,str
,None
] :return: a generator of paragraphs
-
sents
(fileids=None)[source]¶ A generator for sentences in a text, or texts :param fileids: :rtype:
Generator
[str
,str
,None
] :return: a generator of sentences
-
-
class
cltk.readers.readers.
JsonfileCorpusReader
(root, fileids=None, encoding='utf8', skip_keywords=None, target_language=None, paragraph_separator='\n\n', **kwargs)[source]¶ Bases:
nltk.corpus.reader.api.CorpusReader
A corpus reader for Json documents where contents are stored in a dictionary. Supports any documents stored under a text key. A document may have any number of subsections as nested dictionaries, as long as their keys are sortable; they will be traversed and only strings datatypes will be collected as the text. E.g.:
doc[‘text’][‘1’] = “some text” doc[‘text’][‘2’] = “more text” Or with one level of subsections: doc[‘text’][‘1’][‘1’] = “some text” doc[‘text’][‘1’][‘2’] = “more text”
-
words
(fileids=None)[source]¶ Provide the words of the corpus; skipping any paragraphs flagged by keywords to the main class constructor :param fileids: :rtype:
Generator
[str
,str
,None
] :return: words, including punctuation, one by one
-
sents
(fileids=None)[source]¶ - Parameters
fileids –
- Return type
Generator
[str
,str
,None
]- Returns
A generator of sentences
-
paras
(fileids=None)[source]¶ Yield paragraphs of the text, as demarcated by double new lines. :param fileids: single document file or files of proper JSON objects with a text key, and section subkey :rtype:
Generator
[str
,str
,None
] :return: a generator of paragraphs
-
docs
(fileids=None)[source]¶ Returns the complete text of an Text document, closing the document after we are done reading it and yielding it in a memory safe fashion. :return : Python Dictionary of strings or Nested Dictionaries. The top level dictionary also contains the filename from which it spawned.
- Return type
Generator
[Dict
[str
,Any
],Dict
[str
,Any
],None
]
-
-
class
cltk.readers.readers.
TesseraeCorpusReader
(root, fileids=None, encoding='utf8', skip_keywords=None, **kwargs)[source]¶ Bases:
nltk.corpus.reader.plaintext.PlaintextCorpusReader
-
docs
(fileids)[source]¶ Returns the complete text of a .tess file, closing the document after we are done reading it and yielding it in a memory-safe fashion.
-
texts
(fileids, plaintext=True)[source]¶ Returns the text content of a .tess file, i.e. removing the bracketed citation info (e.g. “<Ach. Tat. 1.1.0>”)
-