8.1.1. cltk.alphabet package¶

Modules for accessing the alphabets and character sets of in-scope CLTK languages.

8.1.1.1. Subpackages¶

8.1.1.1.1. cltk.alphabet.grc package

8.1.1.2. Submodules¶

8.1.1.3. cltk.alphabet.ang module¶

The Old English alphabet.

>>> from cltk.alphabet import ang
>>> ang.DIGITS[:5]
['ān', 'tƿeġen', 'þrēo', 'fēoƿer', 'fīf']
>>> ang.DIPHTHONGS[:5]
['ea', 'eo', 'ie']

8.1.1.4. cltk.alphabet.arb module¶

The Arabic alphabet. Sources:

pyarabic https://github.com/linuxscout/pyarabic
arabicstemmer https://github.com/assem-ch/arabicstemmer/blob/master/algorithm/stemmer.sbl

>>> from cltk.alphabet import arb
>>> arb.LETTERS[:5]
('ا', 'ب', 'ت', 'ة', 'ث')
>>> arb.PUNCTUATION_MARKS
['،', '؛', '؟']
>>> arb.ALEF
'ا'
>>> arb.WEAK
('ا', 'و', 'ي', 'ى')

8.1.1.5. cltk.alphabet.arc module¶

The Imperial Aramaic alphabet, plus simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own Unicode block.

TODO: Add Hebrew-to-Aramaic converter

cltk.alphabet.arc.square_to_imperial(square_script)[source]¶

simple script to transform a Hebrew transcription of an Imperial Aramaic text to its own unicode block

Return type: str

8.1.1.6. cltk.alphabet.ben module¶

The Bengali alphabet.

>>> from cltk.alphabet import ben
>>> ben.VOWELS[:5]
['অ', 'আ', 'ই', 'ঈ', 'উ']
>>> ben.DEPENDENT_VOWELS[:5]
['◌া', 'ি', '◌ী', '◌ু', '◌ূ']
>>> ben.CONSONANTS[:5]
['ক', 'খ', 'গ', 'ঘ ', 'ঙ']

8.1.1.7. cltk.alphabet.egy module¶

Convert MdC transliterated text to Unicode.

cltk.alphabet.egy.mdc_unicode(string, q_kopf=True)[source]¶: parameters: string: str q_kopf: boolean return: unicode_text: str The translitterated text passes to the function under the variable ‘string’. The search and replace operation is done for the related caracters. If the q_kopf parameter is False, we replace ‘q’ with ‘ḳ’

8.1.1.8. cltk.alphabet.enm module¶

The Middle English alphabet. Sources:

From Old English to Standard English, Dennis Freeborn
https://web.cn.edu/kwheeler/documents/ME_Pronunciation.pdf
https://en.wikipedia.org/wiki/Middle_English_phonology

The produced consonant sound in Middle English are categorized as following:

Stops: ⟨/b/, /p/, /d/, /t/, /g/, /k/⟩
Affricatives: ⟨/ǰ/, /č/, /v/, /f/, /ð/, /θ/, /z/, /s/, /ž/, /š/, /c̹/, /x/, /h/⟩
Nasals: ⟨/m/, /n/, /ɳ/⟩
Later Resonants: ⟨/l/⟩
Medial Resonants: ⟨/r/, /y/, /w/⟩

Thorn (þ) was gradually replaced by the diphthong “th”, while Eth (ð), which had already fallen out of use by the 14th century, was later replaced by “d”

Wynn (ƿ) is the predecessor of “w”. Modern transliteration scripts, usually replace it with “w” as to avoid confusion with the strikingly similar p

The vowel sounds in Middle English are divided into:

Long Vowels: ⟨/a:/, /e/, /e̜/, /i/ , /ɔ:/, /o/ , /u/⟩
Short Vowels: ⟨/a/, /ɛ/, /I/, /ɔ/, /U/, /ə/⟩

As established rules for ME orthography were effectively nonexistent, compiling a definite list of diphthongs is non-trivial. The following aims to compile a list of the most commonly-used diphthongs.

>>> from cltk.alphabet import enm
>>> enm.ALPHABET[:5]
['a', 'b', 'c', 'd', 'e']
>>> enm.CONSONANTS[:5]
['b', 'c', 'd', 'f', 'g']

cltk.alphabet.enm.normalize_middle_english(text, to_lower=True, alpha_conv=True, punct=True)[source]¶

Normalizes Middle English text string and returns normalized string.

Parameters

text (str) – str text to be normalized
to_lower (bool) – bool convert text to lower text
alpha_conv (bool) – bool convert text to canonical form æ -> ae, þ -> th, ð -> th, ȝ -> y if at beginning, gh otherwise
punct (bool) – remove punctuation

>>> normalize_middle_english('Whan Phebus in the CraBbe had neRe hys cours ronne', to_lower = True)
'whan phebus in the crabbe had nere hys cours ronne'
>>> normalize_middle_english('I pray ȝow þat ȝe woll', alpha_conv = True)
'i pray yow that ye woll'
>>> normalize_middle_english("furst, to begynne:...", punct = True)
'furst to begynne'

Return type: str

8.1.1.9. cltk.alphabet.fro module¶

The normalizer aims to maximally reduce the variation between the orthography of texts written in the Anglo-Norman dialect to bring it in line with “orthographe commune”. It is heavily inspired by Pope (1956). Spelling variation is not consistent enough to ensure the highest accuracy; the normalizer in its current format should therefore be used as a last resort. The normalizer, word tokenizer, stemmer, lemmatizer, and list of stopwords for OF/MF were developed as part of Google Summer of Code 2017. A full write-up of this work can be found at : https://gist.github.com/nat1881/6f134617805e2efbe5d275770e26d350 References : Pope, M.K. 1956. From Latin to Modern French with Especial Consideration of Anglo-Norman. Manchester: MUP. Anglo-French spelling variants normalized to “orthographe commune”, from M. K. Pope (1956)

word-final d - e.g. vertud vs vertu
use of <u> over <ou>
<eaus> for <eus>, <ceaus> for <ceus>
triphtongs:
- <iu> for <ieu>
- <u> for <eu>
- <ie> for <iee>
- <ue> for <uee>
- <ure> for <eure>
“epenthetic vowels” - e.g. averai for avrai
<eo> for <o>
<iw>, <ew> for <ieux>
final <a> for <e>

cltk.alphabet.fro.build_match_and_apply_functions(pattern, replace)[source]¶: Assemble regex patterns.

cltk.alphabet.fro.normalize_fr(tokens)[source]¶

Normalize Old and Middle French tokens.

TODO: Make work work again with a tokenizer.

Return type: List[str]

8.1.1.10. cltk.alphabet.gmh module¶

The alphabet for Middle High German. Source:

Schreibkonventionen des klassischen Mittelhochdeutschen, Simone Berchtold
https://de.wikipedia.org/wiki/Mittelhochdeutsch

The consonants of Middle High German are categorized as:

Stops: ⟨p t k/c/q b d g⟩
Affricates: ⟨pf/ph tz/z⟩
Fricatives: ⟨v f s ȥ sch ch h⟩
Nasals: ⟨m n⟩
Liquids: ⟨l r⟩
Semivowels: ⟨w j⟩

Misc. notes:

c is used at the beginning of only loanwords and is pronounced the same as k (e.g. calant, cappitain)
Double consonants are pronounced the same way as their corresponding letters in Modern Standard German (e.g. pp/p)
schl, schm, schn, schw are written in MHG as sw, sl, sm, sn
æ (also seen as ae), œ (also seen as oe) and iu denote the use of Umlaut over â, ô and û respectively
ȥ or ʒ is used in modern handbooks and grammars to indicate the s or s-like sound which arose from Germanic t in the High German consonant shift.

>>> from cltk.alphabet import gmh
>>> gmh.CONSONANTS[:5]
['b', 'd', 'g', 'h', 'f']
>>> gmh.VOWELS[:5]
['a', 'ë', 'e', 'i', 'o']

cltk.alphabet.gmh.normalize_middle_high_german(text, to_lower_all=True, to_lower_beginning=False, alpha_conv=True, punct=True, ascii=False)[source]¶

Normalize input string.

>>> from cltk.alphabet import gmh
>>> from cltk.languages.example_texts import get_example_text
>>> gmh.normalize_middle_high_german(get_example_text("gmh"))[:50]
'uns ist in alten\nmæren wunders vil geseit\nvon hele'

Parameters

text (str) –
to_lower_beginning (bool) –
to_lower_all (bool) – convert whole text to lowercase
alpha_conv (bool) – convert alphabet to canonical form
punct (bool) – remove punctuation
ascii (bool) – returns ascii form

Returns

normalized text

8.1.1.11. cltk.alphabet.guj module¶

The Gujarati alphabet.

>>> from cltk.alphabet import guj
>>> guj.VOWELS[:5]
['અ', 'આ', 'ઇ', 'ઈ', 'ઉ']
>>> guj.CONSONANTS[:5]
['ક', 'ખ', 'ગ', 'ઘ', 'ચ']

8.1.1.12. cltk.alphabet.hin module¶

The Hindi alphabet.

>>> from cltk.alphabet import hin
>>> hin.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> hin.CONSONANTS[:5]
['क', 'ख', 'ग', 'घ', 'ङ']
>>> hin.SONORANT_CONSONANTS
['य', 'र', 'ल', 'व']

8.1.1.13. cltk.alphabet.kan module¶

The Kannada alphabet. The characters can be divided into 3 categories:

Swaras (Vowels) : 13 in modern Kannada and 14 in Classical
Vynjanas (Consonants) : They are further divided into 2 categories:

Structured Consonants : 25

Unstructured Consonants : 9 in modern Kannada and 11 in Classical

Yogavaahakas (part vowel, part consonant) : 2

Corresponding to each Swaras and Yogavaahakas there is a symbol. Thus Consonant + Vowel Symbol = Kagunita.

>>> from cltk.alphabet import kan
>>> kan.VOWELS[:5]
['ಅ', 'ಆ', 'ಇ', 'ಈ', 'ಉ']
>>> kan.STRUCTURED_CONSONANTS[:5]
['ಕ', 'ಖ', 'ಗ', 'ಘ', 'ಙಚ']

8.1.1.14. cltk.alphabet.lat module¶

Alphabet and text normalization for Latin.

cltk.alphabet.lat.normalize_lat(text)[source]¶

The function for all default Latin normalization.

TODO: Add parameters for stripping macrons, other unlikely chars. Perhaps use remove_non_ascii().

Return type: str

8.1.1.15. cltk.alphabet.non module¶

Old Norse runes, Unicode block: 16A0–16FF. Source: Viking Language 1, Jessie L. Byock

TODO: Document and test better.

class cltk.alphabet.non.AutoName(value)[source]¶

Bases: enum.Enum

An enumeration.

class cltk.alphabet.non.RunicAlphabetName(value)[source]¶

Bases: cltk.alphabet.non.AutoName

An enumeration.

elder_futhark = 'elder_futhark'¶

younger_futhark = 'younger_futhark'¶

short_twig_younger_futhark = 'short_twig_younger_futhark'¶

class cltk.alphabet.non.Rune(runic_alphabet, form, sound, transcription, name)[source]¶

Bases: object

>>> Rune(RunicAlphabetName.elder_futhark, "ᚺ", "h", "h", "haglaz")
ᚺ
>>> Rune.display_runes(ELDER_FUTHARK)
['ᚠ', 'ᚢ', 'ᚦ', 'ᚨ', 'ᚱ', 'ᚲ', 'ᚷ', 'ᚹ', 'ᚺ', 'ᚾ', 'ᛁ', 'ᛃ', 'ᛇ', 'ᛈ', 'ᛉ', 'ᛊ', 'ᛏ', 'ᛒ', 'ᛖ', 'ᛗ', 'ᛚ', 'ᛜ', 'ᛟ', 'ᛞ']

static display_runes(runic_alphabet)[source]¶: Displays the given runic alphabet. :type runic_alphabet: list :param runic_alphabet: list :return: list

static from_form_to_transcription(form, runic_alphabet)[source]¶

Parameters

form (str) –
runic_alphabet (list) –

Returns

conventional transcription of the rune

class cltk.alphabet.non.Transcriber[source]¶

Bases: object

>>> little_jelling_stone = "᛬ᚴᚢᚱᛘᛦ᛬ᚴᚢᚾᚢᚴᛦ᛬ᚴ(ᛅᚱ)ᚦᛁ᛬ᚴᚢᛒᛚ᛬ᚦᚢᛋᛁ᛬ᛅ(ᚠᛏ)᛬ᚦᚢᚱᚢᛁ᛬ᚴᚢᚾᚢ᛬ᛋᛁᚾᛅ᛬ᛏᛅᚾᛘᛅᚱᚴᛅᛦ᛬ᛒᚢᛏ᛬"
>>> Transcriber.transcribe(little_jelling_stone, YOUNGER_FUTHARK)
'᛫kurmR᛫kunukR᛫k(ar)þi᛫kubl᛫þusi᛫a(ft)᛫þurui᛫kunu᛫sina᛫tanmarkaR᛫but᛫'

static from_form_to_transcription(runic_alphabet)[source]¶: Make a dictionary whose keys are forms of runes and values their transcriptions. Used by transcribe method. :type runic_alphabet: list :param runic_alphabet: :return: dict

static transcribe(rune_sentence, runic_alphabet)[source]¶: From a runic inscription, the transcribe method gives a conventional transcription. :type rune_sentence: str :param rune_sentence: str, elements of this are from runic_alphabet or are punctuations :type runic_alphabet: list :param runic_alphabet: list :return:

8.1.1.16. cltk.alphabet.omr module¶

The alphabet for Marathi.

# Using the International Alphabet of Sanskrit Transliteration (IAST), these vowels are represented thus

>>> from cltk.alphabet import omr
>>> omr.VOWELS[:5]
['अ', 'आ', 'इ', 'ई', 'उ']
>>> omr.IAST_VOWELS[:5]
['a', 'ā', 'i', 'ī', 'u']
>>> list(zip(omr.SEMI_VOWELS, omr.IAST_SEMI_VOWELS))
[('य', 'y'), ('र', 'r'), ('ल', 'l'), ('व', 'w')]

8.1.1.17. cltk.alphabet.ory module¶

The Odia alphabet.

>>> from cltk.alphabet import ory
>>> ory.VOWELS["0B05"]
'ଅ'
>>> ory.STRUCTURED_CONSONANTS["0B15"]
'କ'

8.1.1.18. cltk.alphabet.ota module¶

Ottoman alphabet

Misc. notes:

Based off Persian Alphabet Transliteration in CLTK by Iman Nazar
Uses UTF-8 Encoding for Ottoman/Persian Letters
When printing Arabic letters, they appear in the console from left to right and inconsistently linked, but correctly link and flow right to left when inputted into a word processor. The problems only exist in the terminal.

TODO: Add tests

8.1.1.19. cltk.alphabet.oty module¶

Alphabet for Old Tamil. GRANTHA_CONSONANTS are from the Grantha script which was used between 6th and 20th century to write Sanskrit and the classical language Manipravalam.

TODO: Add tests

8.1.1.20. cltk.alphabet.pes module¶

The Persian alphabet.

TODO: Write tests.

cltk.alphabet.pes.mk_replacement_regex()[source]¶

cltk.alphabet.pes.normalize_text(text)[source]¶

8.1.1.21. cltk.alphabet.pli module¶

The Pali alphabet.

TODO: Add tests.

8.1.1.22. cltk.alphabet.processes module¶

This module holds the Process for normalizing text strings, usually before the text is sent to other processes.

class cltk.alphabet.processes.NormalizeProcess(language: str = None)[source]¶

Bases: cltk.core.data_types.Process

Generic process for text normalization.

language: str = None¶

algorithm¶

run(input_doc)[source]¶

This ideally returns an algorithm that takes and returns a string.

Return type: Doc

class cltk.alphabet.processes.GreekNormalizeProcess(language: str = None)[source]¶

Bases: cltk.alphabet.processes.NormalizeProcess

Text normalization for Ancient Greek.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "grc"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = GreekNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
False

language: str = 'grc'¶

class cltk.alphabet.processes.LatinNormalizeProcess(language: str = None)[source]¶

Bases: cltk.alphabet.processes.NormalizeProcess

Text normalization for Latin.

>>> from cltk.core.data_types import Doc, Word
>>> from cltk.languages.example_texts import get_example_text
>>> from boltons.strutils import split_punct_ws
>>> lang = "lat"
>>> orig_text = get_example_text(lang)
>>> non_normed_doc = Doc(raw=orig_text)
>>> normalize_proc = GreekNormalizeProcess(language=lang)
>>> normalized_text = normalize_proc.run(input_doc=non_normed_doc)
>>> normalized_text == orig_text
True

language: str = 'lat'¶

8.1.1.23. cltk.alphabet.san module¶

Data module for the Sanskrit languages alphabet and related characters.

8.1.1.24. cltk.alphabet.tel module¶

Telugu alphabet

TODO: Add tests.

8.1.1.25. cltk.alphabet.text_normalization module¶

Functions for preprocessing texts. Not language-specific.

cltk.alphabet.text_normalization.cltk_normalize(text, compatibility=True)[source]¶

cltk.alphabet.text_normalization.remove_non_ascii(input_string)[source]¶: Remove non-ascii characters Source: http://stackoverflow.com/a/1342373

cltk.alphabet.text_normalization.remove_non_latin(input_string, also_keep=None)[source]¶: Remove non-Latin characters. also_keep should be a list which will add chars (e.g. punctuation) that will not be filtered.

8.1.1.26. cltk.alphabet.urd module¶

Urdu alphabet

TODO: Add tests.

8.1.1. cltk.alphabet package¶

8.1.1.1. Subpackages¶

8.1.1.2. Submodules¶

8.1.1.3. cltk.alphabet.ang module¶

8.1.1.4. cltk.alphabet.arb module¶

8.1.1.5. cltk.alphabet.arc module¶

8.1.1.6. cltk.alphabet.ben module¶

8.1.1.7. cltk.alphabet.egy module¶

8.1.1.8. cltk.alphabet.enm module¶

8.1.1.9. cltk.alphabet.fro module¶

8.1.1.10. cltk.alphabet.gmh module¶

8.1.1.11. cltk.alphabet.guj module¶

8.1.1.12. cltk.alphabet.hin module¶

8.1.1.13. cltk.alphabet.kan module¶

8.1.1.14. cltk.alphabet.lat module¶

8.1.1.15. cltk.alphabet.non module¶

8.1.1.16. cltk.alphabet.omr module¶

8.1.1.17. cltk.alphabet.ory module¶

8.1.1.18. cltk.alphabet.ota module¶

8.1.1.19. cltk.alphabet.oty module¶

8.1.1.20. cltk.alphabet.pes module¶

8.1.1.21. cltk.alphabet.pli module¶

8.1.1.22. cltk.alphabet.processes module¶

8.1.1.23. cltk.alphabet.san module¶

8.1.1.24. cltk.alphabet.tel module¶

8.1.1.25. cltk.alphabet.text_normalization module¶

8.1.1.26. cltk.alphabet.urd module¶

The Classical Language Toolkit

Navigation

Related Topics