Computational Linguistics – News update for Nov 15, 2006

Posted by Alexandre Rafalovitch on November 15, 2006

Lots of new sightings of CL/NLP technologies since the last update:

  • On the commercial speech recognition front, Nexidia is currently in beta with phonemes-mapping audio search. But don’t go to the company’s site. Instead, read the explanation and collection of links is in the ResourceShelf’s article.
  • If, instead of waiting for commercial offerings, you would like to contribute to the open source one, VoxForge always needs more transcribed audio recordings to improve their Command and Control acoustic models.
  • Switching from speech recognition to the speech synthesis, E-health-insider has a fascinating podcast from the field (Somalia), with practical example of how even an imperfect technology can bring tangible benefits to people in need.
  • Text generation might also soon become a more interesting topic. Indiana university recently launched The Synthetic Worlds Initiative and – as part of it – very recently started ARDEN project that will try to produce a synthetic 3D world in the universe of William Shakespeare. They are not planning to have bots in there, but can they resist it, given that a virtual world interface and availability of full texts of Shakespeare’s works make it ideal playground for advanced A.L.I.C.E competitions.
  • If you like text classifications tasks and/or machine learning, there is an Agnostic Learning vs. Prior Knowledge Challenge & Workshop. Dataset Nova is the one for text classification, there are others for different machine learning tasks. There might even be a small prize.
  • For those who only get out of bed for big(ger) prizes, there is the Second Annual CyC prize. The prize is $2,500, but to get it you must publish an academic paper that has something to do with CyC’s knowledge base of assertions about the everyday world. This may or may not be a hard task; you can judge it for yourself by checking out the winners of the last year’s prize. The deadline is February 21st, 2007 and some people may have had an early start since the competition has been running since February this year.
  • Named Entities and Semantic Web come together in the demo put together by InFact that parsed and cross-linked public domain books in a web of names, places and relations. Just don’t try to manually change the urls; the implementation itself is a bit brittle (company was notified). Speaking on a more abstract level, this demo also shows benefits of actually having unrestricted full-text access to books. I feel that public domain books are just waiting to be remixed and experimented with beyond what we see now.
  • Finally, those who missed AOL’s attempt to beat Google’s release of n-gram models, by releasing and then withdrawing 20 million web queries that included private data can still get access to that data from multiple websites, including one with a semi-useful search interface. One wonders if AOL’s executive responsible for the release decision likes the proverbs, specifically the one that goes “A word spoken is past recalling”.

Learning language like children do – as if!

Posted by Alexandre Rafalovitch on November 13, 2006

I keep hearing the claims that one should try learning a foreign language like children do. Roseta Stone is a famous example of software that convinces people that they can do just that.

I have a couple of problems with that approach.

First one is that even if the immersion method was sufficient, it would have to be as immersive as what a child gets – 24 hours a day minus sleep. One hour a day is not sufficient in my opinion. And if you are studying foreign language in an immersive environment, Roseta Stone is just a way to concentrate your mind more than anything. And with its price tag, a very expensive way to concentrate the mind.

The other reason is that when people say immersive environment, they usually mean no grammar rules. Just listening and talking, reading and writing. That’s what children do, right?

Wrong! At least it is wrong for the Russian language. School in USSR used to have a class called Russian Language which run for several school years. It was not about the Russian literature, that was a second, separate class. Russian Language class was about learning the orthography and grammar of our own mother tongue and – trust me! – it was hard.

Declensions were hell. Russian language has six of them and we had to have mnemonics to just remember their order (I still remember «Иван Родил Девчонку, Велел Тащить Пелёнку») The rules for when to write soft and hard sign letters were a story of their own. And dictations! That is when you think that the teacher’s whole purpose in life is to make you want to cry. When every misspelling and a missing coma would drop your grade! And then (the next year) you get rephrasing exercises where you listen to a story three times and have to write it out in your own words afterwards. And you are marked for style as well as orthography.

And, I am sorry to say, we made fun of Georgians and Armenians, because – trying to learn their own complex languages – they never sounded quite right speaking Russian, even though they were also part of USSR. We learned how to say things correctly, because we had anecdotes being told and retold on exactly how they got it wrong.

I always admire people who decide to learn Russian and persevere with its alphabet, its grammar and its pronunciation. But those who think that ‘learning like children’ approach means learning through absorption and with no grammar study, I don’t have much time for. It did not work for us, when we were children. I don’t see how it will work for you, however much you will pay for the software with the fancy claims on its cover.

Free music and more

Posted by Alexandre Rafalovitch on November 9, 2006

Amie St is a very interesting business idea with a good execution. They are music discovery and store with a twist – songs start free and the price goes up based on how popular they get. To encourage ratings and downloads, they even pay to the users who discovered good songs early and recommended them to others. And the songs never get as expensive as iTunes.

I like free music. I have enough music in the personal collection not to buy new tracks for a while (especially not from RIAA members), but I will listen to free songs to see if something really special will catch my attention.

I have tried Pandora and liked it, but obviously those songs are not for download and cannot be repeated easily (if one cares to stay legal anyway). I have tried iTunes’ free single of the week and was very disappointed.

Amie St makes it easy to discover good songs in the genres I like and will certainly keep me coming back for more.

One feature I wish they had is an ability to subscribe to a channel (artist, genre, price-range) and have it downloadable as a podcast of samples. Price range as a parameter might be useful for somebody ready to buy good songs, so they could mark them at 50-60 cents range and get a good deal, yet filtered by crowd’s rating. The delivery could be either individual 30/45 seconds files or one big file with chapter marks to skip easily and with embeded info/album art. That way I could listen to the songs on the move and buy/download them later. This obviously also increases user’s stickiness to the site, as the user interacts with the site even when not on the computer.

Or maybe they just need to open an API and somebody else will do that for them.

(Update: Another good review of the service is at:

RSCDS website refresh

Posted by Alexandre Rafalovitch on November 7, 2006

The Royal Scottish Country Dance Society has updated their website. It now looks prettier, runs on more modern technological base and promises better up-to-date information.

This is the next step after the redesign of the society magazine to move forward in times, while preserving the original goals of the Society.

It is good to see the society recognising that internet is worth putting time and effort into, especially with members and branches all over the world.

On open e-book standards and whether translating to Esperanto will bring more readers?

Posted by Alexandre Rafalovitch on November 5, 2006

There is a fight brewing between David Rothman of TeleRead and Bill Janssen of Plucker fame. The point of contention (as I understand the issue) is what would be good format to produce e-books in.

Bill’s position is that any format that is not already accepted (specifically not html) is a lock-in and a disadvantage, whether that format is an open standard (like OpenReader) or a proprietary one (like Sony’s BBeB). He advocates using web browsers as ebook readers.

David’s point (and he invokes me in there) is that HTML format is not sufficient for all e-books, mostly due to the layout and browser changes issues. So, if HTML is not sufficient, we have to chose a new format. Thefore, it is better if the format is an open standard that can be implemented and maintained by multiple parties.

I am with David here and mostly for the reasons he pointed out. For my interests (language learning e-books), HTML is not a good enough format. Sure, I could hack HTML into submission for some of my goals, but it will require so much javascript, that it will not work in anything but a full-blown browser. I invite Bill to replicate the functionality of the Pocket e-Sword. so that it works well in IE, Firefox, Opera and Safari. Maybe that’s why Pepper Pad is integrating FBRReader despite already having a built in Firefox web browser.

So, where does Esperanto comes into it? Well, here is Bill’s quote (emphasis is mine):

Trying to standardize on a common “ebook format”, be it some IDPF creation, some OASIS masterpiece, or even the so-called OpenReader, would only be an attempt to force them all to publish in Esperanto, instead of their house languages. They still wouldn’t have customers.

Publishing in Esperanto does not bring customers? Really! I wonder where Bill gets that data. I don’t know how many (human)  languages he speak, but the only reasonable way I could interpret that statement was as “publishing English material in Esperanto would not bring any more English customers”. That could be a a point, where he would be mostly correct. Of course, the market for Esperanto is not English, it is global.

As an example, I want to take the book/movie Night Watch by my favourite author Sergey Lukyanenko. The book started in Russian, was made into the Russian movie with english subtitles, impacted American market and finally was translated (quite well) into English. What about Chinese or Egyptians? Would they be interested in this book? Maybe, but there is no easy way to find out because translation or even subtitling is very expensive.

Except that there is a way. Night Watch has just been translated into Esperanto (announcement in russian). There is even an excerpt available (unfortunately in PDF). Now, the book is accessible to people in China, Egypt or Germany, as long as they can read Esperanto. And if there is enough interest from those people, the book can be translated into their native languages as well to reach to the rest of the audience. The push model of finding the markets suddenly becomes a pull model of market finding you. This is not a new idea, it is already used by newspapers and even Vatican. It is called establishing a beachhead, I believe.
And that’s exactly the strength of open standards. They can expand the audience beyond original planned targets and bring new markets to your solution, adapting the solution to the market needs in the process.

Closed standards control the markets they know about, open standards create new, unplanned markets. I am currently in the market segment, Sony does not want to think about. Do I wait another 5 years for Sony to catch up or do I look for open standard and open source alternatives? There should be no need to guess.

