Middle English is collectively the varieties of the English language spoken after the Norman Conquest (1066) until the late 15th century; scholarly opinion varies but the Oxford English Dictionary specifies the period of 1150 to 1500. (Source: Wikipedia)
CLTK’s normalizer attempts to clean the given text, converting it into a canonical form.
to_lower parameter converts the string into lowercase.
In : from cltk.corpus.middle_english.alphabet import normalize_middle_english In : normalize_middle_english("Whan Phebus in the Crabbe had nere hys cours ronne And toward the leon his journé gan take", to_lower=True) Out : 'whan phebus in the crabbe had nere hys cours ronne and toward the leon his journé gan take'
punct is responsible for punctuation removal
In : normalize_middle_english("Thus he hath me dryven agen myn entent, And contrary to my course naturall.", punct=True) Out : 'thus he hath me dryven agen myn entent and contrary to my course naturall'
alpha_conv follows the established spelling conventions developed thorughout the last last century.
þ and ð are both converted to th while 3 is converted to y at the start of the word and to gh otherwise.
In : normalize_middle_english("as 3e lykeþ best", alpha_conv=True) Out : 'as ye liketh best'
CLTK supports a rule-based affix stemmer for ME.
Keep in mind, that while Middle English is considered a weakly inflected language with a grammatical structure resembling that of Modern English, its lack of orthographical conventions presents a difficulty when accounting for various affixes.
In : from cltk.stem.middle_english import affix_stemmer In : from cltk.corpus.middle_english.alphabet import normalize_middle_english In : text = normalize_middle_english('The speke the henmest kyng, in the hillis he beholdis.').split(" ") In : affix_stemmer(text) Out : 'the spek the henm kyng in the hill he behold'
The stemmer can also take an additional parameter of a hard-coded exception dictionary. An example follows utilizing the compiled stopwords list.
In: from cltk.stop.middle_english.stops import STOPS_LIST In: exceptions = dict(zip(STOPS_LIST, STOPS_LIST)) In: affix_stemer('byfore him'.split(" "), exception_list = exceptions) Out: 'byfore him'
To use the CLTK’s built-in stopwords list, We use an example from Chaucer’s “The Summoner’s Tale”:
In : from nltk.tokenize.punkt import PunktLanguageVars In : from cltk.stop.middle_english.stops import STOPS_LIST In : sentence = 'This frere bosteth that he knoweth helle' In : p = PunktLanguageVars() In : tokens = p.word_tokenize(sentence.lower()) In : [w for w in tokens if not w in STOPS_LIST] Out: ['frere', 'bosteth', 'knoweth', 'helle', '.']
The historical events of early 11th century Britain were intertwined with its phonological development. The Norman Conquest in 1066 is mainly responsible for the influx of both Francien and Latin words and by extension for the highly variable spelling and phonology of ME.
While the Stresser provided by CLTK is unable to recognize the stressing of a given word, it does accept some of the most common stressing rules as parameters (Latin/Germanic/French)
In : from cltk.phonology.middle_english.transcription import Word In : ".".join(Word('beren').stresser(stress_rule = "FSR")) Out: "ber.'en" In : ".".join(Word('yisterday').stresser(stress_rule = "GSR")) Out : "yi.ster.'day" In : ".".join(Word('verbum').stresser(stress_rule = "LSR")) Out : "ver.'bum"
Word class provides a syllabification module for ME words.
In : from cltk.phonology.middle_english.transcription import Word In : w = Word("hymsylf") In : w.syllabify() Out : ['hym', 'sylf'] In : w.syllabified_str() Out: 'hym.sylf'