Computational Linguistics – News update for Oct 9, 2006
Posted by Alexandre Rafalovitch on October 9, 2006
Couple of interesting things happened recently in the Computational Linguistics related fields that I thought were worth linking to:
- ACM Queue had an interview with Mike Cohen of Google (previously of Nuance Communications) discussing recent advances and changes in speech recognition technology.
- Pluggd, with its hotly discussed demo of HearHere, uses speech recognition and some sort of topic clustering to show a time heatmap of your search keyword inside the podcast. The idea is that the heatmap allows you to skip straight to the discussion of the topic you are interested in and ignore parts unrelevant to your interest (and adverts). They have a short presentation about the product in DEMOfall archives. Warning: sometimes it takes a couple of tries to get DEMO video to play (depending on system load).
- PodZinger that already used speech recognition to search within podcasts for search terms, just added an advertising platform that is based on classifying by the content and the search term.
- Netflix has created a challenge where they provide recommendation data, so that other people can try developing an algorithm better than Netflix’s own data mining team. With the big prise of a million dollars (1,000,000$), there is a lot of competitors already. While the dataset provided only has movie titles and therefore not enough to do any text/description analysis, it is still a huge dataset to try various graph and neural network methods on. Most of the people suggest mashing it up with IMDB or some other movie information database, but that obviously requires additional data matching work.
- ClearForest on the other hand is only offering 2000 dollars (2000$) in their competition and you have to bring your own data, but at least they provide an API that does named entities recognition. Beats having to load up GATE every time and, who knows, maybe somebody can create another Gutenkarte-style mashup.
- And to finish on a funny note, maybe you would like the one generated by the STANDUP (popularised writeup): What do you get when you cross a car with a sandwich? A traffic jam.
Francesco Sclano said
Hi everybody,
TermExtractor, my master thesis, is online at the address http://lcl2.di.uniroma1.it !!!
TermExtractor is a software package for automatic
building, validation and maintenance of glossaries in
english language.
TermExtractor extracts terminology consensually
referred in a specific application domain. The package
takes as input a corpus of domain documents, parses
the documents, and extracts a list of “syntactically
plausible” terms (e.g. compounds, adjective-nouns,
etc.). Documents parsing assigns a greater importance
to terms with text layouts (title, bold, italic,
underlined, etc.). Two entropy-based measures, called
Domain Relevance and Domain Consensus, are then used.
Domain Consensus is used to select only the terms
which are consensually referred throughout the corpus
documents. Domain Relevance to select only the terms
which are relevant to the domain of interest, Domain
Relevance is computed with reference to a set of
contrastive terminologies from different domains.
Finally, extracted terms are further filtered using
Lexical Cohesion, that measures the degree of
association of all the words in a terminological
string. Accept files formats are: txt, pdf, ps, dvi,
tex, doc, rtf, ppt, xls, xml, html/htm, chm, wpd and
also zip archives.
Alexandre Rafalovitch said
Francesco,
This is not really a good place to announce new software/service. Nobody will find it.
You will have much better luck at any of the repositories listed at ACL Wiki
Computational Linguistics - News update for Nov 15, 2006 « Always Learning! said
[...] Lots of new sightings of CL/NLP technologies since the last update: [...]