Always Learning!

The world through the prism of my mind

Archive for the ‘Computational Linguistics’ Category

Computational Linguistics – News update for Nov 15, 2006

Posted by Alexandre Rafalovitch on November 15, 2006

Lots of new sightings of CL/NLP technologies since the last update:

  • On the commercial speech recognition front, Nexidia is currently in beta with phonemes-mapping audio search. But don’t go to the company’s site. Instead, read the explanation and collection of links is in the ResourceShelf’s article.
  • If, instead of waiting for commercial offerings, you would like to contribute to the open source one, VoxForge always needs more transcribed audio recordings to improve their Command and Control acoustic models.
  • Switching from speech recognition to the speech synthesis, E-health-insider has a fascinating podcast from the field (Somalia), with practical example of how even an imperfect technology can bring tangible benefits to people in need.
  • Text generation might also soon become a more interesting topic. Indiana university recently launched The Synthetic Worlds Initiative and – as part of it – very recently started ARDEN project that will try to produce a synthetic 3D world in the universe of William Shakespeare. They are not planning to have bots in there, but can they resist it, given that a virtual world interface and availability of full texts of Shakespeare’s works make it ideal playground for advanced A.L.I.C.E competitions.
  • If you like text classifications tasks and/or machine learning, there is an Agnostic Learning vs. Prior Knowledge Challenge & Workshop. Dataset Nova is the one for text classification, there are others for different machine learning tasks. There might even be a small prize.
  • For those who only get out of bed for big(ger) prizes, there is the Second Annual CyC prize. The prize is $2,500, but to get it you must publish an academic paper that has something to do with CyC’s knowledge base of assertions about the everyday world. This may or may not be a hard task; you can judge it for yourself by checking out the winners of the last year’s prize. The deadline is February 21st, 2007 and some people may have had an early start since the competition has been running since February this year.
  • Named Entities and Semantic Web come together in the demo put together by InFact that parsed and cross-linked public domain books in a web of names, places and relations. Just don’t try to manually change the urls; the implementation itself is a bit brittle (company was notified). Speaking on a more abstract level, this demo also shows benefits of actually having unrestricted full-text access to books. I feel that public domain books are just waiting to be remixed and experimented with beyond what we see now.
  • Finally, those who missed AOL’s attempt to beat Google’s release of n-gram models, by releasing and then withdrawing 20 million web queries that included private data can still get access to that data from multiple websites, including one with a semi-useful search interface. One wonders if AOL’s executive responsible for the release decision likes the proverbs, specifically the one that goes “A word spoken is past recalling”.

Posted in Computational Linguistics | Leave a Comment »

Lirix – computational linguistics aspects

Posted by Alexandre Rafalovitch on November 2, 2006

In my last update on applied computational linguistics, I have written about PodZinger that uses speech recognition to figure out which advertisement to match to the podcast you are searching with their service.

Another company is claiming to do that with songs – Lirix. Their upcoming AdLirix platform is supposed to be so effective that Lirix would be able to give away songs for free and make back the income by embedding well-targeted advertisements.

The devil of course is in details – many songs have so little meaning in them, that it might be a trial to even figure out what they are about manually, never mind automatically at the volume required to fill an attractively large catalog.

Their DEMOFall presentation did not go into that level of details, so I emailed some questions to Lirix people directly. They promptly replied with an example:

…, here’s a lyrical excerpt from a hiphop song named “How We Do” by a rapper named “The Game”. (This song was a big radio hit last year.)

“I put Lamborghini doors on the Escalade
Low-pro so it looks like I’m riding on blades”

In this case, we would tag the specific words “Lamborghini” and “Escalade”, the phrase “low profile”, and the themes “high-end automotive”, “after-market automotive”, and “bling”.

This looks quite advanced, if the algorithm uses true computational methods. Unfortunately, I have doubts that it does.

I can see how Lamborghini could be matched to the high-end automotive subject (named entity recognition, clustering, even database-lookup). I have no idea how they would also connect the sentence above to the after-market automotive.

I suspect that behind the scenes, Lirix will be doing a lot of manual categorisation. I asked my contact about this issue and got the reply that effectively said “good question – no answers at this stage”. Fair enough. If they can do it automatically, they have a strong competitive advantage; if they cannot do, this may mean they cannot scale fast. Either way, they may have a reason to keep quiet for now.

We will wait and see. I imagine the competition for making money from ‘free(ish)’ songs is heating up. Many techniques will be tried and Natural Language Processing algorithms may prove to be important for the successful business.

Posted in Computational Linguistics | Leave a Comment »

There! are the blogs of computational linguists

Posted by Alexandre Rafalovitch on November 1, 2006

Nine months ago, I had asked “Where are the blogs of computational linguists?” Now, there is an answer.

The Association for Computational Linguistics has moved its documents (formerly ACL Universe) into the Wiki and there is now a separate page for blogs. It has all of the blogs I found so far and more. It even has my blog in it. Must be scraping the bottom of the barrel 🙂 . I don’t know why more CL people are not blogging.

The best news however is that – being a wiki – it can be updated by anybody. So if you have a blog about computational linguistics, add yourself in there. This is not a Wikipedia, (modest) self-promotion is allowed.

You can also register and add the pages you are interested in to your watchlist page, though at this point there is no RSS or email notification when the change occurs.

Posted in Computational Linguistics | Leave a Comment »

Computational Linguistics – News update for Oct 9, 2006

Posted by Alexandre Rafalovitch on October 9, 2006

Couple of interesting things happened recently in the Computational Linguistics related fields that I thought were worth linking to:

  • ACM Queue had an interview with Mike Cohen of Google (previously of Nuance Communications) discussing recent advances and changes in speech recognition technology.
  • Pluggd, with its hotly discussed demo of HearHere, uses speech recognition and some sort of topic clustering to show a time heatmap of your search keyword inside the podcast. The idea is that the heatmap allows you to skip straight to the discussion of the topic you are interested in and ignore parts unrelevant to your interest (and adverts). They have a short presentation about the product in DEMOfall archives. Warning: sometimes it takes a couple of tries to get DEMO video to play (depending on system load).
  • PodZinger that already used speech recognition to search within podcasts for search terms, just added an advertising platform that is based on classifying by the content and the search term.
  • Netflix has created a challenge where they provide recommendation data, so that other people can try developing an algorithm better than Netflix’s own data mining team. With the big prise of a million dollars (1,000,000$), there is a lot of competitors already. While the dataset provided only has movie titles and therefore not enough to do any text/description analysis, it is still a huge dataset to try various graph and neural network methods on. Most of the people suggest mashing it up with IMDB or some other movie information database, but that obviously requires additional data matching work.
  • ClearForest on the other hand is only offering 2000 dollars (2000$) in their competition and you have to bring your own data, but at least they provide an API that does named entities recognition. Beats having to load up GATE every time and, who knows, maybe somebody can create another Gutenkarte-style mashup.
  • And to finish on a funny note, maybe you would like the one generated by the STANDUP (popularised writeup): What do you get when you cross a car with a sandwich? A traffic jam.

Posted in Computational Linguistics | 3 Comments »

How e-books could revolutionize language-learning

Posted by Alexandre Rafalovitch on October 8, 2006

[This article also appears in a slightly edited form as a TeleRead entry]

Ever tried learning a foreign language? Noticed how the books you could read were often boring, and the books you wanted to read were just that bit too hard to understand? Wished, you could have a quick translation of a complex passage or precise meaning of the word from the spread of twenty that dictionary entry offers?

With paper books, you are pretty much stuck. On the other hand, e-books – with the right combination of software and open formats – may soon prove to be just the solution to keep you reading and learning in the new language. And, with the language learning market attracting billions of dollars, you can be sure somebody will find a way to make the best of the possibilities offered by e-books.

So, what are the advantages e-books can bring to language learning? Let’s start from what is achievable today and progress to the possibilities further down the line.

  1. Parallel texts – Intermediate and advanced readers appreciate being able to read original text, while still having a good translation available a glance away. Paper books like this do exist, but just a few, due to a high cost of production and distributed target market.For e-books, the ever decreasing price of the storage makes the size of the download irrelevant – slashing the cost of physical production. And with electronic distribution, the market reach is as wide as the internet itself.If you still have doubts, this model is already being exploited extremely well in another multi-lingual market – bible study. There are many bible translations and scholars like to be able to read them side by side to understand the deep meaning better. A number of free eReaders exist to make this task easier, including a portable one for the PocketPC.
  2. Dictionary bundling – Continuing with the theme of practically unlimited storage, we can easily imagine a book being bundled with a look-up dictionary that is capable of prividing a translation of every word and expression in the text.This is only possible with specially adapted texts at the moment and, even then, only some words and basic phrases are provided.With e-books, it would be possible to embed invisible hints that will show the specific meaning of the phrase in the exact context of the paragraph.Again, something similar to this has been done for Bible study with Strong’s numbers, but, with good dictionaries, the concept can be extended to any text. Many of the current e-book readers allow dictionary lookups, so the basic functionality is already available.
  3. Grammar learning through real examples – Most of the texts provided for learning grammar are boring and feel artificial. How about being able to choose your own text and have the reader software automatically highlight the structures you are learning this week, whether it is colour names, present perfect constructions or conjugations of the irregular verb ‘to be’? With the material being presented completely in context, the rules will be easier to understand and recall. And even if you are rereading the last week’s passage, you are learning something new, as the highlighted parts will change.
  4. Automatic text leveling – If we can bundle additional text that does not show up in the book normally, why can’t we have the same text several times with different levels of reading difficulties. That way, a book may contain adapted/simplified text as well as an original one.Then, any number of combinations might be used, depending on whether the reader is connected to other systems or not.For example, if the book is delivered as serialised chapters from a website, there might be grading tests in between chapters with the simplification level of the next chapter adjusted automatically based on the test results. Or it could be a Choose Your Own Adventure type of book, where the language level of the ‘next’ page depends on the language introduced in pages seen before.

    This is not something that can be done economically at all for printed books, since they are frozen at the time of publishing, but with e-books the material can be reused in smaller chunks across multiple learners and therefore will allow for a viable publishing model. Especially, if we can throw the increasingly ubiquitous broadband wireless connection into the mix with pages arriving in near-real-time directly to the reader device.

  5. Text generation – If a ‘Choose Your Own Adventure’ format is suitable for language learning, why not a computer generated book all together. It is nearly possible with current technologies to create a book as a concept graph and have the computer generate the actual text. With a bit more design, the generated text will purposefully incorporate new words and grammatical structures to match the learner’s progression in the learning plan. It would even be possible to dynamically generate grammatical explanations, since the text-creation system has to figure it all out anyway to generate the sentences. Again, combined with feedback from external or on-device tests and near-real-time downloads, the text can always be just ahead of the learner’s own language knowledge.
  6. Finally, with eBooks’ texts being available in open electronic formats, it would possible to use them to generate additional contexts (such as geographic mapping) or tests based on the exact texts the learner was reading for his or her own pleasure.

There are many more possible uses of texts available in e-book rather than print form. I have just scratched the surface of what is possible.

Nor have I discussed technical details that would make each of these items really tick. Perhaps I will in a future post, if there is enough interest (vote in the comments).

For those curious now, I will just mention two concepts. They are Computational Linguistics and CALL/ATALL. Follow the directions these fields of study point out and your thinking adventures will be more wondrous than those of Alice.

Posted in Computational Linguistics, e-books, Language acquisition, Publishing | 1 Comment »