Sanskrit

Sanskrit is the primary liturgical language of Hinduism, a philosophical language of Hinduism, Jainism, Buddhism and Sikhism, and a literary language of ancient and medieval South Asia that also served as a lingua franca. It is a standardised dialect of Old Indo-Aryan, originating as Vedic Sanskrit and tracing its linguistic ancestry back to Proto-Indo-Iranian and Proto-Indo-European. As one of the oldest Indo-European languages for which substantial written documentation exists, Sanskrit holds a prominent position in Indo-European studies. (Source: Wikipedia)

Corpora

Use CorpusImporter() or browse the CLTK GitHub organization (anything beginning with sanskrit_) to discover available Sanskrit corpora.

In [1]: from cltk.corpus.utils.importer import CorpusImporter

In [2]: c = CorpusImporter('sanskrit')

In [3]: c.list_corpora
Out[3]:
['sanskrit_text_jnu', 'sanskrit_text_dcs', 'sanskrit_parallel_sacred_texts', 'sanskrit_text_sacred_texts', 'sanskrit_parallel_gitasupersite', 'sanskrit_text_gitasupersite','sanskrit_text_wikipedia','sanskrit_text_sanskrit_documents']

Transliterator

This tool has been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool is made for transliterating Itrans text to Devanagari(Unicode) script. Also, it can romanize Devanagari script.

Script Conversion

Convert from one Indic script to another. This is a simple script which exploits the fact that Unicode points of various Indic scripts are at corresponding offsets from the base codepoint for that script.more.

In [1]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import UnicodeIndicTransliterator

In [2]: input_text=u'राजस्थान'

In [3]: UnicodeIndicTransliterator.transliterate(input_text,"hi","pa")
Out[3]: 'ਰਾਜਸ੍ਥਾਨ'

Romanization

Convert script text to Roman text in the ITRANS notation

In [4]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator

In [5]: input_text=u'राजस्थान'

In [6]: lang='hi'

In [7]: ItransTransliterator.to_itrans(input_text,lang)
Out[7]: 'rAjasthAna'

Indicization (ITRANS to Indic Script)

Conversion of ITRANS-transliteration to an Devanagari(Unicode) script

In [8]: from cltk.corpus.sanskrit.itrans.unicode_transliterate import ItransTransliterator

In [9]: input_text=u'pitL^In'

In [10]: lang='hi'

In [11]: x=ItransTransliterator.from_itrans(input_text,lang)

In [12]: x
Out[12]: 'पितॣन्'

Query Script Information

Indic scripts have been designed keeping phonetic principles in nature and the design and organization of the scripts makes it easy to obtain phonetic information about the characters.

In [13]: from cltk.corpus.sanskrit.itrans.langinfo import *

In [14]: c = 'क'

In [15]: lang='hi'

In [16]: is_vowel(c,lang)
Out[16]: False

In [17]: is_consonant(c,lang)
Out[17]: True

In [18]: is_velar(c,lang)
Out[18]: True

In [19]: is_palatal(c,lang)
Out[19]: False

In [20]: is_aspirated(c,lang)
Out[20]: False

In [21]: is_unvoiced(c,lang)
Out[21]: True

In [22]: is_nasal(c,lang)
Out[22]: False

Other similar functions are here,

In [29]: dir(cltk.corpus.sanskrit.itrans.langinfo)
['APPROXIMANT_LIST', 'ASPIRATED_LIST', 'AUM_OFFSET', 'COORDINATED_RANGE_END_INCLUSIVE', 'COORDINATED_RANGE_START_INCLUSIVE', 'DANDA', 'DENTAL_RANGE', 'DOUBLE_DANDA', 'FRICATIVE_LIST', 'HALANTA_OFFSET', 'LABIAL_RANGE', 'LC_TA', 'NASAL_LIST', 'NUKTA_OFFSET', 'NUMERIC_OFFSET_END', 'NUMERIC_OFFSET_START', 'PALATAL_RANGE', 'RETROFLEX_RANGE', 'RUPEE_SIGN', 'SCRIPT_RANGES', 'UNASPIRATED_LIST', 'UNVOICED_LIST', 'URDU_RANGES', 'VELAR_RANGE', 'VOICED_LIST', '__author__', '__builtins__', '__cached__', '__doc__', '__file__', '__license__', '__loader__', '__name__', '__package__', '__spec__', 'get_offset', 'in_coordinated_range', 'is_approximant', 'is_aspirated', 'is_aum', 'is_consonant', 'is_dental', 'is_fricative', 'is_halanta', 'is_indiclang_char', 'is_labial', 'is_nasal', 'is_nukta', 'is_number', 'is_palatal', 'is_retroflex', 'is_unaspirated', 'is_unvoiced', 'is_velar', 'is_voiced', 'is_vowel', 'is_vowel_sign', 'offset_to_char']

Swadesh

The corpus module has a class for generating a Swadesh list for Sanskrit.

In [1]: from cltk.corpus.swadesh import Swadesh

In [2]: swadesh = Swadesh('sa')

In [3]: swadesh.words()[:10]
Out[3]: ['अहम्' , 'त्वम्', 'स', 'वयम्, नस्', 'यूयम्, वस्', 'ते', 'इदम्', 'तत्', 'अत्र', 'तत्र']

Syllabifier

This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a word into its syllables, this can be applied across 17 Indian languages including Devanagari (all using Unicode) script.

In [23]: from cltk.stem.sanskrit.indian_syllabifier import Syllabifier

In [24]: input_text = 'नमस्ते'

In [26]: lang='hindi'

In [27]: x = Syllabifier(lang)

In [28]: current = x.orthographic_syllabify(input_text)
Out[28]: ['न', 'म','स्ते']

Tokenizer

This tool has also been derived from the IndicNLP Project courtesy of anoopkunchukuttan This tool can break a sentence into its constituent words. It works on the basis of filtering out punctuations and spaces.

In [29]: from cltk.tokenize.sentence import TokenizeSentence

In [30]: tokenizer = TokenizeSentence('sanskrit')

In [31]: input_text = "हिन्दी भारत की सबसे अधिक बोली और समझी जाने वाली भाषा है"

In [32]: x = tokenizer.tokenize(input_text)
Out[32]: ['हिन्दी', 'भारत', 'की', 'सबसे', 'अधिक', 'बोली', 'और', 'समझी', 'जाने', 'वाली', 'भाषा', 'है']

Stopword Filtering

To use the CLTK’s built-in stopwords list:

In [1]: from cltk.stop.sanskrit.stops import STOPS_LIST

In [2]: from cltk.tokenize.indian_tokenizer import indian_punctuation_tokenize_regex

In [3]: s = "हमने पिछले पाठ मे सीखा था कि “अहम् गच्छामि” का मतलब “मै जाता हूँ” है। आप ऊपर
   ...:  की तालिकाँओ "

In [4]: tokens = indian_punctuation_tokenize_regex(s)

In [5]: len(tokens)
Out[5]: 20

In [6]: no_stops = [w for w in tokens if w not in STOPS_LIST]

In [7]: len(no_stops)
Out[7]: 18

In [8]: no_stops
Out[8]:
['हमने',
 'पिछले',
 'पाठ',
 'सीखा',
 'था',
 'कि',
 '“अहम्',
 'गच्छामि”',
 'मतलब',
 '“मै',
 'जाता',
 'हूँ”',
 'है',
 '।',
 'आप',
 'ऊपर',
 'की',
 'तालिकाँओ']