Opensource Fellowships and localization at SARAI day II (plus Delhi Linux User Group meeting)

I already wrote a long post about the first day of the FLOSS Opensource and localization fellowships presentation at SARAI, and here is more from the second day of presentations.

I also wanted to mention that at the end of the first day, we watched a very interesting video from a lecture given in the US in the 1980’s. A scholar of Sanskrit (American) began by talking about the Mahabharata - the epic poem of Hinduism and Indian culture, twice the length of the Bible - and how he was trying to do advanced analysis on it to ascertain which parts had been orally delivered, and which parts had been authored as a written text. For this, it would help having the text available on computer, but trying to type in the whole thing was quite impossible.

Luckily his son was a computer engineer, and working together, they had figured out a way to do OCR - text recognition - of the scanned pages of the text written in the Devanagari script (same as used for Hindi). The computers and machines shown were so primitive, that it is amazing to think of what they were able to achieve. The son was also very good at explaining the thinking process behind designing the OCR program, and although I could have never replicated it, it was very interesting to understand the principles behind it. Also humbling to think that they had designed such a program in the 1980’s, and yet today, we are still struggling to produce good OCR for many Indian scripts (and we think we are working at the “cutting edge” of technology…).

Speech Recognition in Hindi

Sachin Joshi described to us the process of creating a speech recognition system in Hindi. He used Sphinx, an open source system developed at Carnegie Mellon that enables you to construct a voice recognition system in any language. He mentioned that Indian languages were difficult, because they had such a wide spread of dialects and accents. He therefore focused on Northern Hindi, and used a bit over 30 volunteer readers, in different age groups, males and females, to read for about half an hour each. They gathered a large corpus, and then used a computer software to designate the most “efficient” sentences, that cover the most usual sounds in the shortest amount of time.

He then demonstrated to us, using a small script which would record an amount of speech, process it, and then output it in Hindi. When he spoke, using some of the example sentences, it worked fairly well, but when some of the volunteers tried with other sentences, it was not as impressive. However, it is early in development, and it can be improved quite a bit. Already it was impressive, and many of the participants were very excited about the prospects.

What is also exciting is that the project provides the acoustic model - which is the hardest to create - as an API for any other project to build upon. They can take it as a black-box, and add a linguistic model (which contains information about what a person can be expected to say), and build all kinds of applications on top of that, from directory lookup, to train reservation systems, to computer interfaces for illiterate people.

There was also a discussion about how to find good corpora in Hindi that are open source, and that reflect modern spoken Hindi.

Urdu localization

Sawood Alam told us about his project to localize Gnome 2.2 into Urdu. He discussed some of the terms that he had chosen (for me it was amusing hearing several words that were similar to Indonesian, for example terjemah, translate, which is identical in Indonesian). He also discussed several challenges with Urdu language computing and localization, such as input methods and different fonts. After a good amount of time practicing, he reached 40 strings per hour in translation speed. He said it was more difficult to do localization in Urdu, because there had not been a government committee like in the other Indic languages, that had defined a technical terminology. However, the people in the audience were adamant that he indeed was lucky being able to start from scratch, since the terminology resulting from the government committees was in any case too archaic and foreign to users.

Finally some problems with mixing right and left input was discussed, for example when inputting placeholders like %s into the text strings, where the Urdu text goes from right-to-left but the %s goes from left-to-right. Apparently this is also a problem in providing domain names in the Urdu script, as we were told by a person from the Indian Internet Exchange.

Meeting of Delhi Linux User Group

At the end of the meeting, there was a meeting of the Delhi Linux User Group. Gora introduced, talking a bit about upcoming elections, and hoping that they would attract more young, and new faces. Then a guy from OSSCamp.in came on to explain about their concept. They organize “un-conference” style gatherings about twice a year, where anyone can come to give a presentation, and learn from each other. They pride themselves on reducing the distance between professionals and amateurs, and to provide a relaxed forum where people can interact. They often have between 50 and 70 participants, mostly students, even from high schools. They also held a mobile camp (about mobile applications, hardware etc) in Mumbai, which was a huge success.

There were many questions from the audience, and some suggestions. Overall I thought the audience was a bit too critical to their concept, and not appreciative enough of the large amount of diffferent kinds of events for different people that are needed. Indeed, I would have loved to have attended the one that happened at the end of June - had I known about it. This led to a short discussion about the feasibility of having a portal of community events in the FOSS world.

All in all, it was a wonderful gathering. I learnt a huge amount about localization, and the specific challenges that different Indic languages and scripts present, and also about good work being done. I also met some very interesting people that I hope to stay in touch with. I wish the presentations had been advertised better, because I think it might have been interesting to more than the few people who came (mostly people who were presenting themselves). But hopefully through these writeups, at least people will know what happened - and what is going on. If you want more information about any of these projects, or you want to apply for a Sarai FLOSS fellowship (they will be having more), then contact gora at sarai dot net.

Stian

Similar posts that might interest you: