Nice work! If I'm not mistaken, the root requires morphological disambiguation, which may change depending on the context/phrase in which the word is observed.
This is an active area of research in Morphologically Rich Languages (MRLs), since this problem also appears in other semitic languages like Hebrew, as well as Turkish. There's a nice body of work to learn from, both with and without neural nets. For example, this paper from 2017 (http://aclweb.org/anthology/D17-1073) uses a neural model for morphological disambiguation. You can see a nice comparison of tools in the recent 2018 Universal Dependencies Shared Task results: http://universaldependencies.org/conll18/results-lemmas.html (look for ar_padt).
If you're looking for training data, the Arabic treebanks in http://universaldependencies.org could help. I think some of them contain surface tokens with lemmas. I'm quite sure they also have roots.
This is exciting to see. I am a Semitic philologist (Ph.D.) now breaking into the IT industry, and this sort of work is on my radar, though mostly with Hebrew and Aramaic. Arabic, being a Semitic language, has a non-linear morphology, which means that extracting the root has to be done by extracting non-inflectional consonants from all possible positions in a word. If you train a NN with full conjugation paradigms, over a data set, it should be able to begin to recognize what the various inflectional morphemes are. In other words, instead of looking for the root, look for everything that is not the root, and the root is what is left over. For example, the NN should be able to recognize that mu-, ya-, ta-, 'āC-, -ā-. -Ct-, -unna, etc. are all inflectional morphemes. It should also begin to recognize the various matres lectionis or letters indicating long vowels just as alif, waw, and ha. (I'm including vowels in my analysis, because I think like a philologist, not a typical reader of Arabic. Using unvowelled text might be more difficult for the NN.) Anyway, these are just some off-the-cuff thoughts. I look forward to digging deeper into your code and methodology sometime soon.
Thanks, that's awesome! I am a software engineer and long time student of Arabic. You're pretty much on the mark with the capabilities of the model at this point. It can recognize the simple morphemes and long vowels but stumbles on more complex constructions. Definitely ping me on GitHub if you have any questions about anything in the repo or if you just want to talk shop about linguistics / data science.
This is an interesting project, although as others have mentioned the space of arabic words is (reasonably) bounded and an explicit parsing approach or something that uses known language data might prove to be more efficient and accurate.
Along those lines, I might be able to provide some useful json (the basis for http://www.arabicreference.com) in case you are interested. I've been meaning to do some fun investigations using this data (e.g. predicting broken plurals, masadir, form I internal vowelling) but haven't yet had time.
Wow, really nice app! I love the simple-yet-highly-responsive interface. I'm definitely going to be sharing this around to my translator friends. Just curious, are you a translator?
This is really cool. Interestingly, Bayyinah (www.bayyinah.com) Institute has a site [1] for the opposite problem where you can generate the 10 morphological families from root letters.
Some languages, like Arabic, have "roots" which are used to form words. These roots usually have an abstract or ambiguous meaning. You can take those roots and turn them into words, which have a defined meaning, by using a form, kind of like a template. So one root for example is BDL. The kind of vague meaning is to exchange or replace something. One template you can use is TaXXeeX, with the root letters going in place of the Xs. So this results in the word "tabdeel", which means "an exchange".
So what the NN does is takes as input a word, and tries to find its root. "Tabdeel" was the first input listed, and the output was "BDL".
Oops, I didn't quite understand the question. Yep, inputs are Arabic words represented as Unicode strings, with or without inflection marks (ex. متوسّط), and outputs are a string consisting of 3 or 4 Arabic letters (وسط) or an empty string for "no root".
No problem. By the way, I added some quick and dirty romanization in the output stage if you're interested. It in no way represents how the words sound, but it does make it easier for the Roman eye to parse.
I took some Arabic at university. It's a fascinating language. My impression was that the morphology is quite regular, I wonder how complicated an old fashioned, hand-written parser with comparable accuracy would be.
It should be possible, because Arabic morphology follows a logical set of rules (mostly). For example, given a word like متوسط, you could match it for the standard conjugations for the 10 standard verb forms, of which it would match for #5 (متفعل), giving و-س-ط as the root. Verb form + conjugation would probably get you 50% of the way there (depending on how you count the number of valid words...) and I wouldn't be surprised if it's possible with just a regex. It would get a little harder once you go past the 10 verb forms and into plurals and adjective forms, which are usually shorter words, but a little less regular in their construction. It seems like it would be cumbersome to catch all those forms algorithmically. But someone has probably taken the time to do it.
Knowing arabic, I'm not very surprised of this result. I actually don't think there is underlying pattern, most of the words are memorized and created by social conventions.
It's been a while since I read up on this, but as I remember the (Western) description is that there are the roots and the derived forms (which are numbered one to ten/twelve), and then for each derived form there are one or more patterns corresponding to a word class.
So the root d-r-s has derived verbs darasa and darrasa, and to each of those correspond, say, one or more patterns for the verbal noun. But I don't think there is exactly one pattern for verbal nouns derived from the form 2 verb (e.g. from darrasa we get tadris, as I recall, but not all verbs that go like fa33ala will necessarily have a masdar of the form taf3il, right?).
You're right of course, that even though the forms have prototypical systematic semantic variation (like form 2 is usually a causative, "to teach" versus "to learn"), it's not predictable which derived forms of a given root enter into actual usage and with which exact meaning, and the patterns obviously predict a lot of words that don't actually exist, and of course Arabic speakers learn words just like speakers of any other language.
> and the patterns obviously predict a lot of words that don't actually exist
I think I remember that there are a handful of cases where speakers started using some of the previously unattested forms in modern times to refer to new concepts... is that right?
Well, new words have to come from somewhere... new roots can be introduced (and be made to act like Arabic ones - like how the plural of film is aflam), but the derivation patterns are also productive like in any other language. As an example, though the loan word 'computer' is apparently common, there's also the word 'hasuub' which I learned in my Modern Standard/Media Arabic course. The exact form is not listed in Wehr's 1968 dictionary, so maybe it wasn't used at that time, but it is straightforwardly derived from a root with same meaning as the English word 'to compute' and has the form fa3uul which (according to Wolfdietrich Fischer's Grammar of Classical Arabic) is an 'abstract or verbal noun', so basically a calque of the English.
It really depends on what dialects of Arabic you are looking at and what your goals are. My starting dataset is very small. I study classical Arabic, so I am only looking at fus-ha or MSA (not quite the same but very close), and I only pulled a small subset of the total number of words because this started out as just a toy project. The total number of fus-ha words in use today is probably in the low millions. But if you extend to all dialects of Arabic and all time periods then you may reach half a billion to a billion words. If you go from written Arabic to spoken Arabic, then there's no telling how big the input set is. Practically infinite at that point.
For the purpose of just using the language, the morphological rules are well-understood. One of the most popular dictionaries (Hans Wehr) is arranged not in alphabetical order, but by root. And there are many online lexicons with morphological metadata as listed in other comments here. So you're right, machine learning is not necessary here (but that's not to say it couldn't be done). This is mainly just for fun and learning.
Now, if you were to add all dialects of Arabic, then you may have a use case for ML...
From my experience any rules based approach in linguistics may solve the problem at first but it won't scale for too long (it has been tried before, most of old NLP was very rules based I would say) due to many intricacies of languages, corner cases and non-obvious exceptions. It could work — and it fact it does — for simple scenarios, but these are not very real world ones IMHO. I believe a neural model of such cases, over time, will yield better results. I am very skeptical these days about doing computer linguistics without considering some level of AI (so to speak).
Doesn't seem that deep, I mean the NN itself :) Any particular reason for that? I mean it is pretty hard to imagine that this net can build a model of the problem area.
Edit: thanks for the response, it is interesting result that more layers don't improve the situation.
I think the formatting on the output is a bit confusing. The input is jumla جملة (sentence) and the output is J-M-L or ج-م-ل, not the word جمل, which yes, is a word for camel. I think it would be easier to read if I just added dashes between each output letter to emphasize that it is not a word, but a collection of letters.
Great work! Are there any plans for latin character support? I’m Turkish and I’d love to be able to play with the words we’ve borrowed from Arabic but unfortunately I don’t know the Arabic script.
I added some basic transliteration in the output formatting stage so people who don't know Arabic script can understand the results. Accepting romanized input would require a little more work, so I'd have to tinker a bit there.
To make it work in loanwords from Arabic, a consonant phonology table needs to be added, because the target language may have fewer consonants than Arabic.
It's always written cursive, i.e. the letters are joined together. The letters represent sounds. A letter has the same sound regardless of context. Vowels are divided into long and short ones, and the short ones aren't written. I.e. the word "(he) wrote" is pronounced kataba, with three short a-sounds, is written ktb.
You can write out the short vowels but it's only done in special contexts (like children's books, books for foreign language learners, or the Qur'an). It's easy enough to read if you know the language, but it makes it a little harder to learn.
It's actually not very different. Hebrew works in the same way, except it's not cursive. They share a common ancestor in Phoenician script I think, also used for a Semitic language, for which it works well. The Greek alphabet (and thence ours) was derived the Phoenician script that worked in a similar fashion, by adding vowels.
It is a writing system found in Eurasia. Hit its wikipedia pages, on the right hand side there'll be a little thingie that lists 'parent systems', you can follow it back to Phoenician which also happens to be an ancestor of the Latin alphabet.
This is an active area of research in Morphologically Rich Languages (MRLs), since this problem also appears in other semitic languages like Hebrew, as well as Turkish. There's a nice body of work to learn from, both with and without neural nets. For example, this paper from 2017 (http://aclweb.org/anthology/D17-1073) uses a neural model for morphological disambiguation. You can see a nice comparison of tools in the recent 2018 Universal Dependencies Shared Task results: http://universaldependencies.org/conll18/results-lemmas.html (look for ar_padt).
If you're looking for training data, the Arabic treebanks in http://universaldependencies.org could help. I think some of them contain surface tokens with lemmas. I'm quite sure they also have roots.
Also, you might want to take a look at the SIGMORPHON CONLL shared task (2017 https://sites.google.com/view/conll-sigmorphon2017/ and 2018 https://sigmorphon.github.io/sharedtasks/2018/) on morphological reinflection, which IIRC is a similar task - taking an inflected form and reinflecting it with other morphological properties. They also have a nice data set to train on.