8.1.1. cltk.alphabet package

Modules for accessing the alphabets and character sets of in-scope CLTK languages.

8.1.1.2. Submodules

8.1.1.3. cltk.alphabet.ang module

The Old English alphabet.

>>> from cltk.alphabet import ang
>>> ang.DIGITS[:5]
['ān', 'tƿeġen', 'þrēo', 'fēoƿer', 'fīf']
>>> ang.DIPHTHONGS[:5]
['ea', 'eo', 'ie']

8.1.1.4. cltk.alphabet.arb module

The Arabic alphabet. Sources:

>>> from cltk.alphabet import arb
>>> arb.LETTERS[:5]
('ا', 'ب', 'ت', 'ة', 'ث')
>>> arb.PUNCTUATION_MARKS
['،', '؛', '؟']
>>> arb.ALEF
'ا'
>>> arb.WEAK
('ا', 'و', 'ي', 'ى')

8.1.1.5. cltk.alphabet.arc module

The Imperial Aramaic alphabet, plus simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own Unicode block.

TODO: Add Hebrew-to-Aramaic converter

cltk.alphabet.arc.square_to_imperial(square_script)[source]

simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own unicode block

Return type

str

8.1.1.6. cltk.alphabet.ben module

The Bengali alphabet.

>>> from cltk.alphabet import ben
>>> ben.VOWELS[:5]
['অ', 'আ', 'ই', 'ঈ', 'উ']
>>> ben.DEPENDENT_VOWELS[:5]
['◌া', 'ি', '◌ী', '◌ু', '◌ূ']
>>> ben.CONSONANTS[:5]
['ক', 'খ', 'গ', 'ঘ ', 'ঙ']

8.1.1.7. cltk.alphabet.egy module

Convert MdC transliterated text to Unicode.

cltk.alphabet.egy.mdc_unicode(string, q_kopf=True)[source]

parameters: string: str q_kopf: boolean return: unicode_text: str The translitterated text passes to the function under the variable ‘string’. The search and replace operation is done for the related caracters. If the q_kopf parameter is False, we replace ‘q’ with ‘ḳ’

8.1.1.8. cltk.alphabet.enm module

The Middle English alphabet. Sources:

The produced consonant sound in Middle English are categorized as following:

  • Stops: ⟨/b/, /p/, /d/, /t/, /g/, /k/⟩

  • Affricatives: ⟨/ǰ/, /č/, /v/, /f/, /ð/, /θ/, /z/, /s/, /ž/, /š/, /c̹/, /x/, /h/⟩

  • Nasals: ⟨/m/, /n/, /ɳ/⟩

  • Later Resonants: ⟨/l/⟩

  • Medial Resonants: ⟨/r/, /y/, /w/⟩

Thorn (þ) was gradually replaced by the diphthong “th”, while Eth (ð), which had already fallen out of use by the 14th century, was later replaced by “d”

Wynn (ƿ) is the predecessor of “w”. Modern transliteration scripts, usually replace it with “w” as to avoid confusion with the strikingly similar p

The vowel sounds in Middle English are divided into:

  • Long Vowels: ⟨/a:/, /e/, /e̜/, /i/ , /ɔ:/, /o/ , /u/⟩

  • Short Vowels: ⟨/a/, /ɛ/, /I/, /ɔ/, /U/, /ə/⟩

As established rules for ME orthography were effectively nonexistent, compiling a definite list of diphthongs is non-trivial. The following aims to compile a list of the most commonly-used diphthongs.

>>> from cltk.alphabet import enm
>>> enm.ALPHABET[:5]
['a', 'b', 'c', 'd', 'e']
>>> enm.CONSONANTS[:5]
['b', 'c', 'd', 'f', 'g']
cltk.alphabet.enm.normalize_middle_english(text, to_lower=True, alpha_conv=True, punct=True)[source]

Normalizes Middle English text string and returns normalized string.

Parameters
  • text (str) – str text to be normalized

  • to_lower (bool) – bool convert text to lower text

  • alpha_conv (bool) – bool convert text to canonical form æ -> ae, þ -> th, ð -> th, ȝ -> y if at beginning, gh otherwise

  • punct (bool) – remove punctuation

>>> normalize_middle_english('Whan Phebus in the CraBbe had neRe hys cours ronne', to_lower = True)
'whan phebus in the crabbe had nere hys cours ronne'
>>> normalize_middle_english('I pray ȝow þat ȝe woll', alpha_conv = True)
'i pray yow that ye woll'
>>> normalize_middle_english("furst, to begynne:...", punct = True)
'furst to begynne'
Return type

str

8.1.1.9. cltk.alphabet.fro module

The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer in its current format should therefore be used as a last resort. The normalizer, word tokenizer, stemmer, lemmatizer, and list of stopwords for OF/MF were developed as part of Google Summer of Code 2017. A full write-up of this work can be found at : https://gist.github.com/nat1881/6f134617805e2efbe5d275770e26d350 References : Pope, M.K. 1956. From Latin to Modern French with Especial Consideration of Anglo-Norman. Manchester: MUP. Anglo-French spelling variants normalized to “orthographe commune”, from M. K. Pope (1956)

  • word-final d - e.g. vertud vs vertu

  • use of <u> over <ou>

  • <eaus> for <eus>, <ceaus> for <ceus>

  • triphtongs:
    • <iu> for <ieu>

    • <u> for <eu>

    • <ie> for <iee>

    • <ue> for <uee>

    • <ure> for <eure>

  • “epenthetic vowels” - e.g. averai for avrai

  • <eo> for <o>

  • <iw>, <ew> for <ieux>

  • final <a> for <e>

cltk.alphabet.fro.build_match_and_apply_functions(pattern, replace)[source]

Assemble regex patterns.

cltk.alphabet.fro.normalize_fr(tokens)[source]

Normalize Old and Middle French tokens.

TODO: Make work work again with a tokenizer.

Return type

List[str]

8.1.1.10. cltk.alphabet.gmh module

The alphabet for Middle High German. Source:

The consonants of Middle High German are categorized as:

  • Stops: ⟨p t k/c/q b d g⟩

  • Affricates: ⟨pf/ph tz/z⟩

  • Fricatives: ⟨v f s ȥ sch ch h⟩

  • Nasals: ⟨m n⟩

  • Liquids: ⟨l r⟩

  • Semivowels: ⟨w j⟩

Misc. notes:

  • c is used at the beginning of only loanwords and is pronounced the same as k (e.g. calant, cappitain)

  • Double consonants are pronounced the same way as their corresponding letters in Modern Standard German (e.g. pp/p)

  • schl, schm, schn, schw are written in MHG as sw, sl, sm, sn

  • æ (also seen as ae), œ (also seen as oe) and iu denote the use of Umlaut over â, ô and û respectively

  • ȥ or ʒ is used in modern handbooks and grammars to indicate the s or s-like sound which arose from Germanic t in the High German consonant shift.

>>> from cltk.alphabet import gmh
>>> gmh.CONSONANTS[:5]
['b', 'd', 'g', 'h', 'f']
>>> gmh.VOWELS[:5]
['a', 'ë', 'e', 'i', 'o']
cltk.alphabet.gmh.normalize_middle_high_german(text, to_lower_all=True, to_lower_beginning=False, alpha_conv=True, punct=True, ascii=False)[source]

Normalize input string.

>>> from cltk.alphabet import gmh
>>> from cltk.languages.example_texts import get_example_text
>>> gmh.normalize_middle_high_german(get_example_text("gmh"))[:50]
'uns ist in alten\nmæren wunders vil geseit\nvon hele'
Parameters
  • text (str) –

  • to_lower_beginning (bool) –

  • to_lower_all (bool) – convert whole text to lowercase

  • alpha_conv (bool) – convert alphabet to canonical form

  • punct (bool) – remove punctuation

  • ascii (bool) – returns ascii form

Returns

normalized text

8.1.1.11. cltk.alphabet.guj module

The Gujarati alphabet.

>>> from cltk.alphabet import guj
>>> guj.VOWELS[:5]
['અ', 'આ', 'ઇ', 'ઈ', 'ઉ']
>>> guj.CONSONANTS[:5]
['ક', 'ખ', 'ગ', 'ઘ', 'ચ']

8.1.1.12. cltk.alphabet.hin module

The Hindi alphabet.

>>> from cltk.alphabet import hin
>>> hin.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> hin.CONSONANTS[:5]
['क', 'ख', 'ग', 'घ', 'ङ']
>>> hin.SONORANT_CONSONANTS
['य', 'र', 'ल', 'व']

8.1.1.13. cltk.alphabet.kan module

The Kannada alphabet. The characters can be divided into 3 categories:

  1. Swaras (Vowels) : 13 in modern Kannada and 14 in Classical

  2. Vynjanas (Consonants) : They are further divided into 2 categories:

    1. Structured Consonants : 25

    1. Unstructured Consonants : 9 in modern Kannada and 11 in Classical

  1. Yogavaahakas (part vowel, part consonant) : 2

Corresponding to each Swaras and Yogavaahakas there is a symbol. Thus Consonant + Vowel Symbol = Kagunita.

>>> from cltk.alphabet import kan
>>> kan.VOWELS[:5]
['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ']
>>> kan.STRUCTURED_CONSONANTS[:5]
['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ']

8.1.1.14. cltk.alphabet.lat module

Alphabet and text normalization for Latin.

cltk.alphabet.lat.normalize_lat(text)[source]

The function for all default Latin normalization.

TODO: Add parameters for stripping macrons, other unlikely chars. Perhaps use remove_non_ascii().

Return type

str

8.1.1.15. cltk.alphabet.non module

Old Norse runes, Unicode block: 16A0–16FF. Source: Viking Language 1, Jessie L. Byock

TODO: Document and test better.

class cltk.alphabet.non.AutoName(value)[source]

Bases: enum.Enum

An enumeration.

class cltk.alphabet.non.RunicAlphabetName(value)[source]

Bases: cltk.alphabet.non.AutoName

An enumeration.

elder_futhark = 'elder_futhark'
younger_futhark = 'younger_futhark'
short_twig_younger_futhark = 'short_twig_younger_futhark'
class cltk.alphabet.non.Rune(runic_alphabet, form, sound, transcription, name)[source]

Bases: object

>>> Rune(RunicAlphabetName.elder_futhark, "ᚺ", "h", "h", "haglaz")

>>> Rune.display_runes(ELDER_FUTHARK)
['ᚠ', 'ᚢ', 'ᚦ', 'ᚨ', 'ᚱ', 'ᚲ', 'ᚷ', 'ᚹ', 'ᚺ', 'ᚾ', 'ᛁ', 'ᛃ', 'ᛇ', 'ᛈ', 'ᛉ', 'ᛊ', 'ᛏ', 'ᛒ', 'ᛖ', 'ᛗ', 'ᛚ', 'ᛜ', 'ᛟ', 'ᛞ']
static display_runes(runic_alphabet)[source]

Displays the given runic alphabet. :type runic_alphabet: list :param runic_alphabet: list :return: list

static from_form_to_transcription(form, runic_alphabet)[source]
Parameters
  • form (str) –

  • runic_alphabet (list) –

Returns

conventional transcription of the rune

class cltk.alphabet.non.Transcriber[source]

Bases: object

>>> little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬"
>>> Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK)
'᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'
static from_form_to_transcription(runic_alphabet)[source]

Make a dictionary whose keys are forms of runes and values their transcriptions. Used by transcribe method. :type runic_alphabet: list :param runic_alphabet: :return: dict

static transcribe(rune_sentence, runic_alphabet)[source]

From a runic inscription, the transcribe method gives a conventional transcription. :type rune_sentence: str :param rune_sentence: str, elements of this are from runic_alphabet or are punctuations :type runic_alphabet: list :param runic_alphabet: list :return:

8.1.1.16. cltk.alphabet.omr module

The alphabet for Marathi.

# Using the International Alphabet of Sanskrit Transliteration (IAST), these vowels are represented thus

>>> from cltk.alphabet import omr
>>> omr.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> omr.IAST_VOWELS[:5]
['a', 'ā', 'i', 'ī', 'u']
>>> list(zip(omr.SEMI_VOWELS, omr.IAST_SEMI_VOWELS))
[('य', 'y'), ('र', 'r'), ('ल', 'l'), ('व', 'w')]

8.1.1.17. cltk.alphabet.ory module

The Odia alphabet.

>>> from cltk.alphabet import ory
>>> ory.VOWELS["0B05"]
'ଅ'
>>> ory.STRUCTURED_CONSONANTS["0B15"]
'କ'

8.1.1.18. cltk.alphabet.ota module

Ottoman alphabet

Misc. notes:

  • Based off Persian Alphabet Transliteration in CLTK by Iman Nazar

  • Uses UTF-8 Encoding for Ottoman/Persian Letters

  • When printing Arabic letters, they appear in the console from left to right and inconsistently linked, but correctly link and flow right to left when inputted into a word processor. The problems only exist in the terminal.

TODO: Add tests

8.1.1.19. cltk.alphabet.oty module

Alphabet for Old Tamil. GRANTHA_CONSONANTS are from the Grantha script which was used between 6th and 20th century to write Sanskrit and the classical language Manipravalam.

TODO: Add tests

8.1.1.20. cltk.alphabet.pes module

The Persian alphabet.

TODO: Write tests.

cltk.alphabet.pes.mk_replacement_regex()[source]
cltk.alphabet.pes.normalize_text(text)[source]

8.1.1.21. cltk.alphabet.pli module

The Pali alphabet.

TODO: Add tests.

8.1.1.22. cltk.alphabet.processes module

This module holds the Process for normalizing text strings, usually before the text is sent to other processes.

class cltk.alphabet.processes.NormalizeProcess(language: str = None)[source]

Bases: cltk.core.data_types.Process

Generic process for text normalization.

language: str = None
algorithm
run(input_doc)[source]

This ideally returns an algorithm that takes and returns a string.

Return type

Doc

class cltk.alphabet.processes.GreekNormalizeProcess(language: str = None)[source]

Bases: cltk.alphabet.processes.NormalizeProcess

Text normalization for Ancient Greek.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "grc"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = GreekNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
False
language: str = 'grc'
class cltk.alphabet.processes.LatinNormalizeProcess(language: str = None)[source]

Bases: cltk.alphabet.processes.NormalizeProcess

Text normalization for Latin.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "lat"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = GreekNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
True
language: str = 'lat'

8.1.1.23. cltk.alphabet.san module

Data module for the Sanskrit languages alphabet and related characters.

8.1.1.24. cltk.alphabet.tel module

Telugu alphabet

TODO: Add tests.

8.1.1.25. cltk.alphabet.text_normalization module

Functions for preprocessing texts. Not language-specific.

cltk.alphabet.text_normalization.cltk_normalize(text, compatibility=True)[source]
cltk.alphabet.text_normalization.remove_non_ascii(input_string)[source]

Remove non-ascii characters Source: http://stackoverflow.com/a/1342373

cltk.alphabet.text_normalization.remove_non_latin(input_string, also_keep=None)[source]

Remove non-Latin characters. also_keep should be a list which will add chars (e.g. punctuation) that will not be filtered.

8.1.1.26. cltk.alphabet.urd module

Urdu alphabet

TODO: Add tests.