Root Word Identification In Malayalam Language English Language Essay

Wordss are tools of life which is ubiquitous in every linguistic communication. All words in a linguistic communication are alone holding

their ain map and significance. The syntactic and semantic cognition about single words can be

encapsulated in a extremely structured depository known as computational vocabulary which is really indispensable for

Machine Translation. For planing a computational vocabulary, the first and first undertaking is to place the

caput words or root words in the linguistic communication. The Root Word Identifier proposed in this work is a regulation based

attack which automatically removes the inflected portion and deduce the root words utilizing morphophonemic

regulations. The system is tested with 2400 words from a Malayalam principal to bring forth the lingual information

such as the root signifier, their inflected signifiers and grammatical class. The public presentation is evaluated utilizing

the statistical steps like Precision, Recall and F-measure. The values obtained for these steps are

more than 90 % .

KEYWORDS

Corpus, Computational vocabulary, Morphophonemic regulations, Root Word, Root word Identifier

1. Introduction

A computational vocabulary plays an of import function in Machine Translation since it is the topographic point

where all the information about the vocabulary of a linguistic communication is recorded for proper use. A

Machine Translation system save tremendous sum of human power and clip for the interlingual rendition

of one linguistic communication text into another when the beginning and the mark linguistic communications have computational

vocabularies of their ain. The proposed system developed a package system for automatic

designation of root words with their grammatical belongingss. These lingual information about

the words are consistently stored as lexical entries in a computational vocabulary.

The demand for placing the root signifier of a word is really of import in Natural Language

Processing. A root word can be taken as a cardinal term for seeking, indexing, interpreting etc.

Statistical tools like frequence counter, harmony, keyword-in-context, n-gram etc. necessitate root

signifier of a word to cognize more about the vocabulary. For illustration, utilizing a frequence counter, most

often used word in the vocabulary can be found out. Besides such information can assist to

predict the right spelling of the word, if presuming that the most frequent usage of a word is

correct. Morphologic analyzers and generators besides require root words. Lexicographers assign

root word as the caput word/lexical entry in the vocabulary.

Malayalam is the official linguistic communication of the province of Kerala situated in the southern half of West

seashore of India. Malayalam linguistic communication is one among the 22 functionary linguistic communications of India and one

among the four major linguistic communications of the Dravidian household [ 1 ] . The influence of other linguistic communications

like Sanskrit, Tamil, Telugu, Tulu, Toda, Kota, Kodagu and Badaga is seen in phonemic,

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

160

morphemic and grammatical degrees of linguistic communication. Malayalam is a morphologically rich and

polysynthetic linguistic communication. There is no differentiation on upper and lower instance characters.

Most of the words in Malayalam are happening in its inflected signifier. For obtaining the root signifier of

the words, the postfixs agglutinated with them are to be removed. Besides the morphophonemic

alteration ( sandhi ) happening when a root word concatenates with a postfix should be analysed and

generalized. Sandhi regulations are phonological alternations that are triggered at occasions, at junctions

of words or morphemes. Malayalam grammar has categorised sandhi regulations into different types.

Harmonizing to its consonant-vowel brace based classification, there are svara sandhi

( svaraM~+svaraM~ ) , svara vyanjgana sandhi ( svaraM~+vyanjanam ) , vyanjgana svara sandhi

( vyanjanam +svaraM~ ) , vyanjgana sandhi ( vyanjganaM~ + vyanjganaM~ ) . Here svaram is

vowel and vyanjganaM~ is the harmonic [ 1 ] . The morphophonemic alterations happening in the terminal

phonemes of the word and the initial phoneme of the postfix is used to deduce morphophonemic

regulation.

Morphologic analyzers and parts-of-speech taggers developed for Malayalam attempted to happen

out the root signifier of the word. In our attack, no vocabulary or lexicon is used. The surface

construction of the words is studied utilizing a principal to deduce the morphophonemic regulation. The Root

Word Identifier system uses these generalized regulations to automatically place the root words, the

grammatical class and their inflected signifiers. A principal is created for Malayalam linguistic communication

utilizing paperss from universe broad web. 2400 words in this principal are used to prove the system. The

consequences obtained are stored in a computational vocabulary holding the lingual information about

these words such as whether the word is a noun, pronoun, verb or postposition. It besides gives the

inflected signifiers of root word nowadays in the principal. Performance of the system is evaluated and

obtained a high Preciseness, Recall and F-measure. Incremental development of computational

vocabulary can be attained by utilizing the proposed system with richer principal as the resource.

The paper is organized as follows. The related work done in this country is presented in subdivision 2.

Section 3 discusses about the methodological analysis of Root Word Identifier, resource for this work and

the linguistics analysis of Malayalam words. In the same subdivision, morphophonemic regulation is

discussed. Section 4 gives the consequences and treatments about the work. The last subdivision gives the

decision about the work.

2. RELATEDWORKS

The morphological analysis trades with the survey of internal construction of words of a linguistic communication

based on its grammatical class. It is the procedure of sectioning a morphologically inflected

word into its root word and its associated morphological constituents along with the characteristics

stipulating the morphological construction [ 2 ] . Even though a fully fledged morphological analyzer is

non for Malayalam linguistic communication, there are many efforts in this country, as discussed below.

Morphologic analysis for Malayalam verbs utilizing a intercrossed attack ( paradigm and postfix

depriving method ) is an effort made to achieve morphological generalization of verbs [ 3 ] . There

will be dictionary of lexical points of Malayalam, which contains lexical points, grammatical

class and paradigm type. The plan compares each inflected signifier. The verbs are

categorized into 28 categories or paradigms based on the past tense marker. They identified around

1100 inflexions of verb. Using the same intercrossed attack, a Malayalam morphological analyzer

utilizing Apertium Lttoolbox is developed at Language Technology Centre, Centre for Development

of Advanced Computing ( C-DAC ) , Thiruvanathapuram [ 4 ] as portion of Machine Translation undertaking.

Lttoolbox is available with the Apertium toolkit, which is an unfastened beginning shallow-transfer

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

161

Machine Translation system originated with in the undertaking “ Open-source Machine Translation for

the Language of Spain ” . Lttoolbox can be customised to any linguistic communication by including the needed

lexical dictionary [ 5 ] . It uses the FST attack for making lexical processing. Certain other

efforts utilizing stochastic taggers like HMM are besides in advancement but they can non give a high

truth for Malayalam because the linguistic communication is inflectionally rich and is comparatively free-word

order like Tamil.

Apart from these efforts, related plants to achieve the purpose for developing fully fledged Machine

Translation systems are besides traveling on in this linguistic communication technology field. Developing Named

Entity Recognizer system, Noun Phrase Chunkers, Computational Lexicon etc. for Malayalam

linguistic communication are on advancement.

3. ROOTWORD IDENTIFIERMETHODOLOGY

Malayalam words can happen in its root signifier, inflected signifier, derived signifier, compound signifier and in

reduplicated signifier. Inflected words are formed by the affixation of grammatical characteristics such a

instance, figure, tense, facet, temper etc. to the root word. The procedure of dividing the affixes

from an inflected word can supply the root of the word and its grammatical information. The

root word should be the most basic signifier of a word that is able to convey a peculiar description,

idea, or significance. The definition given for root word is that it is a existent word that can do

new words from root words by adding prefixes and postfixs [ 6 ] . Here all signifiers which are affixed

instantly after the root word is considered as postfix. No effort was made to place the

prefix.

The Malayalam paperss seen in the universe broad web are collected and shops as Malayalam

principal. Corpus [ 7 ] is a big aggregation of written and/or spoken text samples available in

machine-readable signifier, collected in a scientific manner to stand for the usage of a linguistic communication [ 8 ] . In this

work, a principal of 24,000 words in written signifier is used for lingual analysis of Malayalam from

different spheres. Using the regulation based attack, the regulations that govern the suffixation are

derived manually by analyzing the words in the principal. The Root Word Identifier system will

divide the root words after taking the postfixs as per these regulations.

Common grammatical classs for Malayalam are the noun, pronoun, verb, adverb, adjectival,

postpositions, indeclinables, clitics etc. In this work, the chief grammatical classs such as

noun, pronoun, verb, postpositions are analysed.

3.1. Noun Morphology

Nouns can happen in isolation or can take gender markers, plural markers, instance postfixs,

postpositions, clitics etc. It takes the signifier

W= noun root A± [ plural postfix ] A± [ instance postfix ] A± [ postpositions ] A± [ clitics ] A± aˆ¦

where W is any word holding the belongingss of a noun. Some of them are shown in Table 1.

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

162

Table 1. Case markers in Malayalam with Example

Case Marker Example

Nominative nothing mankan~

Accusative -e makane

Dative -kkU/ ( n ) U makan

Sociative -ootU makanootU

Locative -il~ makanil~

Instrumental -aal~ makanaal~

Possessive -ute/nte makante

Plural signifiers of noun contain suffix ‘-maaR~ ‘ and ‘kaL~ ‘ . But it besides take some allomorphs ‘ngngal ‘

and ‘kkaL~ ‘ . Apart from these instance markers and plural postfixs there are postfixs which are

agglutinated with nouns [ 9 ] . Some of them are allative marker ( -ileekk ) , place locative postfix ( –

athth ) , optative postfix ( -aakatee ) , reciprocality postfix ( -tammil ) , sufficient postfix ( -maththi ) etc.

Pronouns are those words which can be used alternatively of nouns. Since they are in free signifier,

sing them as root word for easy analysis: njaan~ ( I ( personal pronoun, first individual,

singular ) ) , nii ( You ( personal pronoun, 2nd individual, singular ) ) , avan~ ( He ( Third individual,

remote, remarkable, Masc. ) etc. Sixty two pronouns are identified manually.

3.2. Verb Morphology

The morphological construction of verb is complex. They are capable of taking tense markers. Verbs

in Malayalam are non inflected for Person Number and Gender. All verbal signifiers in Malayalam

both finite and non finite consists of verb stems followed by affixes which express assorted

grammatical classs such as tense, facet, temper, voice valency alteration [ 9 ] [ 10 ] [ 11 ] etc.

By and large Tense is classified into past, present and future. Aspect as perfect tense, imperfective,

imperfect. Mood as declarative, interrogative, imperative, conditional, optative, debitive,

possible. Two voices are active voice and inactive voice. Valency alteration is classified into

causative and inactive.

Classification of verbs [ 12 ] can be done harmonizing to the postfixs attached to verb signifiers.

Identified 51 verb postfixs to recover the verbal signifiers of words. Some of them are listed

below with postfixs and illustration

1. Past tense marker – /-njnju/ , /-nnu/ , /-RRu/ , /-ththu/ , /-thu/ , /-i/ , /-ccu/ , /-Ntu/ , /-ttu/

Eg: paRanjnju ( told ) , ezhuthi ( wrote )

2. Present tense marker – /-unnu/

Eg: varunnu ( coming )

3. Debitive emphasized marker /-aNee/

Eg: tharaNee ( should give )

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

163

3.3. Postpositions

Robert Caldwell ( 1913 ) , commented that every postpositions annexed to a noun constitutes,

decently speech production, a new instance. On the footing of above definition, postpositions are besides considered

here as markers. English linguistic communication has prepositions alternatively of Malayalam postposition. Different

types of categorizations are given for postpositions harmonizing to their beginning, morphological

similarities such as atoms, instance indexs, co-ordinator, derived noun etc. Twenty-seven

postfixs are identified as postpositional postfixs. Some of them are

1. Duration/Distance suffix – /-oolaaM~/

Eg: raamanooLaaM~ ( upto/till/ about raman )

2. Equality suffix – /-poole/

Eg: kittiyepoole ( like a kid )

3.4. Morphophonemic Changes in Malayalam

The postfixs obtained from the manual analysis are grouped harmonizing to their initial phonemes.

Group A contains all the postfixs get downing with /-a/ , Group AA will incorporate all the postfixs get downing

with /-aa/ and so on. A elaborate analysis of the morphophonemic alteration happening when the concluding

syllable of the root word concatenate with initial phoneme of the postfix is carried out. They are

generalized and morphophonemic regulations are derived to place the root signifier of the word [ 13 ] .

3.5. Morphophonemic Rule Implementation for Root Word Identification

A word in the principal is analysed from right to go forth. The first encountered postfix portion is removed

and if any nexus morph such as /in/ , /u/ is present, they are besides removed. The staying portion of the

word is taken for farther analysis. If the last syllable is /y/ or /v/ , it is sufficient to chop them for

obtaining the root signifier of the word. If the last syllable is /ththa/ convert this syllable to /-aM~/ . In

this manner, a list of regulations is at that place to bring forth root signifier of words. Most of the words are confined to

these regulations. Merely limited exclusions are identified. Some of them are listed in Table 2. In these

words, some are holding concluding syllables similar to suffix stoping. They are stored individually as

exceeding words and considered them as root words.

Table 2. Example for list of exceeding words

Examples for list of exceeding word

skumaaR~

bil~

kaan~

Ooroo

Koccu

The postfixs agglutinated with noun and verbs are classified as noun postfix, verb postfix and double

functional postfix ( postfixs which agglutinate with both ) . The NS, VS and DS as shown in Figure

1. contain these postfixs. These are the accessary files to the Root Word Identifier system. The

words from the principal are fed to the Root Word Identifier system and morphophonemic regulations as

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

164

discussed in subdivision 3.5 are applied to the word to obtain the root words. The system iteratively

place the postfixs and word. At the terminal of the undertaking, the root word is identified. The complete

inside informations of the words from the Malayalam principal are stored in Computational Lexicon. Block

diagram of the system is shown in Figure 1.

Figure 1. Block Diagram demoing the input and end product to Root Word Identifier

3.6. Algorithm for Root Word Identifier

Begin

step1: scan the word from right manus side

step2: place the postfix S and compare S with VS

if present

shop the word in verb class

else

step3: comparison S in NS, DS and EX

if present

step4: Shop the word harmonizing to the corresponding grammatical class

so, Remove S and use morphophonemic regulations

step5: repetition stairss from 1 to 4 until the root word is encountered

step6: Shop the root word in computational vocabulary with grammatical class

terminal

Darmstadtium

Ex-husband

Malayalam Wordss

Malayalam Corpus

Root Word

Identifier

Computational Lexicon

Volt

Nitrogen

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

165

4. RESULTS AND DISCUSSIONS

The work is implemented with Practical Extraction and Report Language ( PERL ) [ 14 ] in Linux

environment utilizing Unicode supportive fount.

Given a principal as input, the Root Word Identifier system can bring forth a computational vocabulary

holding the lingual inside informations about the words in the principal such as the root signifier of the word,

grammatical class of root word, inflected signifiers of the root word, all the postfixs agglutinated

with the word, name of each postfix, words obtained after taking each postfix and their

grammatical class.

An illustration which shows how the word kuttikaLutekuute significance ‘with the kids ‘ is processed

by the Root Word Identifier

[ word from Malayalam principal ]

– + + +

[ Phonetic notation ]

kuttikaLutekuute- kutti + kaL~ + Ute + kuute

[ Description ]

word- child+ plural postfix + possessive instance postfix + suffix demoing interior motion

[ Linguistic information ]

kuttikaLutekuute-Noun word

kuttikaLute – Noun word

kuttikaL~ – Noun word

kutti – Noun word – Root word

An nonsubjective analysis of the public presentation of the system is done utilizing the statistical steps –

Recall, Precision and F-measure.

1. Preciseness ( P ) = tp

tp+fp

2. Recall ( R ) = tp

tp+fn

3. F-measure= 2*P*R/ ( P+R )

where tp is the true positives, fp is the false positives and fn is the false negatives [ 15 ] .

The typographical signifiers and foreign linguistic communication words, which have less relevancy in this work, are

removed from the trial principal and the staying 23,045 words were used to prove the system. Root

Word Identifier plan will place the words in the principal and gives the root signifier of the word

with its grammatical class such as noun, verb or double functional. The system showed

Preciseness, Recall and F-measure values as 95.42 % , 95.05 % and 95.22 % when identifying root

words. Compound words and reduplicated words are stored as a individual root word. Merely a little

per centum of words were unrefined and were processed wrongly. The system is holding many

advantages when compared with bing morphological analyzers. Merely a monolingual

Malayalam principal is required to develop a computational vocabulary. So no demand to hold any vocabulary

or machine clear dictionary or manual aid for roll uping words.

International Journal of Computer Science & A ; Information Technology ( IJCSIT ) Vol 4, No 3, June 2012

166

5. Decision

As a portion of developing a computational vocabulary for Malayalam there arises a demand for developing

a Root Word Identifier which identifies the root signifier of the word. A elaborate morphological

analysis is carried out to analyze the implicit in construction of Malayalam words. Morphophonemic

regulations are derived to obtain the root signifier of the word automatically with its grammatical class

and inflected signifiers. This regulation based attack with a larger principal can lend much to the

development of a fully fledged computational vocabulary.