The Urdu alphabet and digits are placed in cltk/corpus/urdu/alphabet.py.
The digits are placed in a list
DIGITS with the digit the same as the list index (0-9). For example, the urdu digit for 4 can be accessed in this manner:
In : from cltk.corpus.urdu.alphabet import DIGITS In : DIGITS Out: '٤'
Persian has three
SHORT_VOWELS that are essentially diacritics used in the script. It also has four LONG_VOWELS that are actually part of the alphabet. The corresponding lists can be imported:
In : from cltk.corpus.urdu.alphabet import SHORT_VOWELS In : SHORT_VOWELS Out: ['َ', 'ِ', 'ُ'] In : from cltk.corpus.urdu.alphabet import LONG_VOWELS In : LONG_VOWELS Out: ['ا', 'و', 'ی', 'ے']
The rest of the alphabet are
CONSONANTS that can be accessed in a similar way.
There are three
SPECIAL characters that are ligatures or different orthographical shapes of the alphabet.
In : from cltk.corpus.urdu.alphabet import SPECIAL In : SPECIAL Out: ['ﺁ', 'ۀ', 'ﻻ']