validate the css validate the xhtml

HackerMoJo.com


Ceci n'est pas une blog
by Glenn Franxman, Django Developer / Stunt Programmer.

OpenCalais beats me to the punch

posted: 2008-02-11 04:57:23 perma-link, RSS comments feed

I've been playing with various entity and information extraction frameworks for the past couple weeks with the goal of creating an web service for extracting the major topics from news articles. SP far, my work, such that it is, has shown promise, but is not as robust or reliable as I would have hoped.

I just noticed that Reuters has apparantly been working along the same lines and have opened their work to the public. On the one hand, I feel beaten. On the other, their service does not appear to be much better than my own - although they'll have a larger set of people re-training their decision trees than I'll ever have.

I've been using posts from my own blog as my testing ground, so I thought I'd throw my last post through and see what it churns out:

Relations: PersonProfessional
Organization: Entrepreneur Fund
IndustryTerm: rubber
Person: Len Gilbert, Tom Berreca, Darline Jean, Glenn Franxman, Matthew de Gannon, Sam Parker, Clark, Beth Higbee, Eleanor Cippel, Martha Stewart
City: Naples

Well, I don't even try to extract relations, so that's pretty cool.

IndustryTerm is pretty puzzling, although I understand why they were fooled by it.

My organization extraction appears to handle this better, inso far as I am extracting things like The Weather Channel, etc from that post.

My City extraction is better in so far as it fetches the state along with the city name.

Person extraction is interesting. They got Martha Stewart, where as I keep tripping over it, and call her Martha Stewart Living, and I pull out Omnimedia as an Orgnaization.

For comparisons sake, here's the result of my own project:
13LOCATION:Summit
23LOCATION:Naples,Fl
40ORGANIZATION:Ritz-CarltonGolfResort
101PERSON:Matthew
103PERSON:Gannon
105ORGANIZATION:SVP
107ORGANIZATION:TheWeatherChannelInteractive
112PERSON:LenGilbert
117PERSON:Hill
119PERSON:DarlineJean
131ORGANIZATION:Products
133ORGANIZATION:TheWeatherChannelInteractive
153ORGANIZATION:Weather
171ORGANIZATION:TomBerreca
175ORGANIZATION:SVPDigital&Emerging;
179PERSON:Media
181PERSON:MarthaStewartLiving
184ORGANIZATION:Omnimedia
186PERSON:SamParker
189ORGANIZATION:CNET
194PERSON:Tom
201ORGANIZATION:Scripps
204PERSON:Martha
257PERSON:EleanorCippel
260ORGANIZATION:Scripps
268PERSON:BethHigbee
271ORGANIZATION:ScrippsNetworksInteractive
367PERSON:Eleanor

Fugly formatting aside, I'm being way more aggressive in my extraction. I think I'm going to need better confidence heuristics. Also, I'm a lot slower than OpenCalais.

There's all sorts of interesting differences, many probably come from the body of text they've got to train against. I'm still fooled by names like Matt de Ganon. The most interesting bit, though, is how the two systems treated the phrase "Martha Stewart Living Omnimedia".

Oh well. They have a bounty for anyone who can create a wordpress plugin for this. I don't use wordpress, or I would have done tonight. As it stands, I might plug it into this site or create a javascript badge to make it usable anywhere.


Comments

1#1

Michael commented, on May 12, 2008 at 12:23 p.m.:

OpenCalais will be updating the site this week. Check it out at the end of the week.

2#2

Abhishek commented, on July 21, 2008 at 3:25 p.m.:

Have you tried with the new version available over our website now? I guess you should give it a shot. There are various improvements done to it.
Abhishek

3#3

Klezio commented, on March 2, 2009 at 6:49 a.m.:

OpenCalais is great, it is used in : http://www.klezio.com
News are automatically classified and news metadata extracted ; Contextual information is fetched from apps such as wikipedia, flickr, twitter or delicious.
Hope it'll serve.
Regards,
Klezio Team

10#10

Claudia commented, on September 23, 2012 at 1:16 a.m.:

I have been making my own ectraxt for a couple of years now. I use a few extra beans though, just for more potency. I got a PILE of beans from Beanilla (no, I don't work for them) and got a fantastic deal. Seems like they even threw in a few ectraxt grade beans for free with my order. I have so many, I have beans and ectraxt all over the kitchen!One more thing to try add some beans to a subtle Bourbon. It makes for a great cookie additive, or even to pour into coffee at night, or hot tea. And you can always add a used bean to a container of sugar to make your own Vanilla sugar. Never throw a bean away!Love your site, and congrats on winningthe LC. I LOVE mine!Susan at doughmesstic.com

13#13

mllnifhmbd commented, on September 25, 2012 at 3:52 p.m.:

js8ppN <a href="http://xkbbbxfwhwcl.com/">xkbbbxfwhwcl</a>

19#19

Sowjanya commented, on June 26, 2013 at 8:13 p.m.:

I have been making my own eaxtrct for a couple of years now. I use a few extra beans though, just for more potency. I got a PILE of beans from Beanilla (no, I don't work for them) and got a fantastic deal. Seems like they even threw in a few eaxtrct grade beans for free with my order. I have so many, I have beans and eaxtrct all over the kitchen!One more thing to try add some beans to a subtle Bourbon. It makes for a great cookie additive, or even to pour into coffee at night, or hot tea. And you can always add a used bean to a container of sugar to make your own Vanilla sugar. Never throw a bean away!Love your site, and congrats on winningthe LC. I LOVE mine!Susan at doughmesstic.com

Post a comment


Based upon your reading habits, might I recommend:

Or, you might like:

Copyright © 2003,2004,2005,2006,2007,2008 GFranxman. All Rights Reserved


hosting: slicehost.com. powered by: django. written in: python. controlled by: bzr. monsters by: monsterID.

You've been exposed to: {'Science & Technology': 1}