Hindi¶
Hindi is a standardised and Sanskritised register of the Hindustani language. Like other Indo-Aryan languages, Hindi is considered to be a direct descendant of an early form of Sanskrit, through Sauraseni Prakrit and Śauraseni Apabhraṃśa. It has been influenced by Dravidian languages, Turkic languages, Persian, Arabic, Portuguese and English. Hindi emerged as Apabhramsha, a degenerated form of Prakrit, in the 7th century A.D. By the 10th century A.D., it became stable. (Source: Wikipedia)
Corpora¶
Use CorpusImporter()
or browse the CLTK GitHub organization (anything beginning with hindi_
) to discover available Hindi corpora.
In [1]: from cltk.corpus.utils.importer import CorpusImporter
In [2]: c = CorpusImporter('hindi')
In [3]: c.list_corpora
Out[3]:
['hindi_text_ltrc']
Stopword Filtering¶
To use the CLTK’s built-in stopwords list:
In [1]: from cltk.stop.classical_hindi.stops import STOPS_LIST
In [2]: print(STOPS_LIST[:5])
Out[2]: ["हें", "है", "हैं", "हि", "ही"]
Swadesh¶
The corpus module has a class for generating a Swadesh list for classical hindi.
In [1]: from cltk.corpus.swadesh import Swadesh
In [2]: swadesh = Swadesh('hi')
In [3]: swadesh.words()[:10]
Out[3]: ['मैं', 'तू', 'वह', 'हम', 'तुम', 'वे', 'यह', 'वह', 'यहाँ', 'वहाँ' ]
Tokenizer¶
This tool can break a sentence into its constituent words. It simply splits the text into tokens of words and punctuations.
In [1]: from cltk.tokenize.sentence import TokenizeSentence
In [2]: import os
In [3]: root = os.path.expanduser('~')
In [4]: hindi_corpus = os.path.join(root,'cltk_data/hindi/text/hindi_text_ltrc')
In [5]: hindi_text_path = os.path.join(hindi_corpus, 'miscellaneous/gandhi/main.txt')
In [6]: hindi_text = open(hindi_text_path,'r').read()
In [7]: tokenizer = TokenizeSentence('hindi')
In [8]: hindi_text_tokenize = tokenizer.tokenize(hindi_text)
In [9]: print(hindi_text_tokenize[0:100])
['10्र', 'प्रति', 'ा', 'वापस', 'नहीं', 'ली', 'जातीएक', 'बार', 'कस्तुरबा', 'गांधी', 'बहुत', 'बीमार', 'हो', 'गईं', '।', 'जलर्', 'चिकित्सा', 'से', 'उन्हें', 'कोई', 'लाभ', 'नहीं', 'हुआ', '।', 'दूसरे', 'उपचार', 'किये', 'गये', '।', 'उनमे', 'भी', 'सफलता', 'नहीं', 'मिली', '।', 'अंत', 'में', 'गांधीजी', 'ने', 'उन्हें', 'नमक', 'और', 'दाल', 'छोडने', 'की', 'सलाह', 'दी', '।', 'परन्तु', 'इसके', 'लिए', 'बा', 'तैयार', 'नहीं', 'हुईं', '।', 'गांधीजी', 'ने', 'बहुत', 'समझाया', '.', 'पोथियों', 'से', 'प्रमाण', 'पढकर', 'सुनाये', '.', 'लेकर', 'सब', 'व्यर्थ', '।', 'बा', 'बोलीं', '.', '"', 'कोई', 'आपसे', 'कहे', 'कि', 'दाल', 'और', 'नमक', 'छोड', 'दो', 'तो', 'आप', 'भी', 'नहीं', 'छोडेंगे', '।', '"', 'गांधीजी', 'ने', 'तुरन्त', 'प्रसÙ', 'होकर', 'कहा', '.', '"', 'तुम']