posted: 2008-02-11 04:57:23 perma-link, RSS comments feed
I've been playing with various entity and information extraction frameworks for the past couple weeks with the goal of creating an web service for extracting the major topics from news articles. SP far, my work, such that it is, has shown promise, but is not as robust or reliable as I would have hoped.
I just noticed that Reuters has apparantly been working along the same lines and have opened their work to the public. On the one hand, I feel beaten. On the other, their service does not appear to be much better than my own - although they'll have a larger set of people re-training their decision trees than I'll ever have.
I've been using posts from my own blog as my testing ground, so I thought I'd throw my last post through and see what it churns out:
Relations: PersonProfessional
Organization: Entrepreneur Fund
IndustryTerm: rubber
Person: Len Gilbert, Tom Berreca, Darline Jean, Glenn Franxman, Matthew de Gannon, Sam Parker, Clark, Beth Higbee, Eleanor Cippel, Martha Stewart
City: Naples
Well, I don't even try to extract relations, so that's pretty cool.
IndustryTerm is pretty puzzling, although I understand why they were fooled by it.
My organization extraction appears to handle this better, inso far as I am extracting things like The Weather Channel, etc from that post.
My City extraction is better in so far as it fetches the state along with the city name.
Person extraction is interesting. They got Martha Stewart, where as I keep tripping over it, and call her Martha Stewart Living, and I pull out Omnimedia as an Orgnaization.
For comparisons sake, here's the result of my own project:
13LOCATION:Summit
23LOCATION:Naples,Fl
40ORGANIZATION:Ritz-CarltonGolfResort
101PERSON:Matthew
103PERSON:Gannon
105ORGANIZATION:SVP
107ORGANIZATION:TheWeatherChannelInteractive
112PERSON:LenGilbert
117PERSON:Hill
119PERSON:DarlineJean
131ORGANIZATION:Products
133ORGANIZATION:TheWeatherChannelInteractive
153ORGANIZATION:Weather
171ORGANIZATION:TomBerreca
175ORGANIZATION:SVPDigital&Emerging;
179PERSON:Media
181PERSON:MarthaStewartLiving
184ORGANIZATION:Omnimedia
186PERSON:SamParker
189ORGANIZATION:CNET
194PERSON:Tom
201ORGANIZATION:Scripps
204PERSON:Martha
257PERSON:EleanorCippel
260ORGANIZATION:Scripps
268PERSON:BethHigbee
271ORGANIZATION:ScrippsNetworksInteractive
367PERSON:Eleanor
Fugly formatting aside, I'm being way more aggressive in my extraction. I think I'm going to need better confidence heuristics. Also, I'm a lot slower than OpenCalais.
There's all sorts of interesting differences, many probably come from the body of text they've got to train against. I'm still fooled by names like Matt de Ganon. The most interesting bit, though, is how the two systems treated the phrase "Martha Stewart Living Omnimedia".
Oh well. They have a bounty for anyone who can create a wordpress plugin for this. I don't use wordpress, or I would have done tonight. As it stands, I might plug it into this site or create a javascript badge to make it usable anywhere.
|
Based upon your reading habits, might I recommend: Or, you might like: |
hosting: slicehost.com.
powered by: django.
written in: python.
controlled by: bzr.
monsters by: monsterID.
You've been exposed to: {'Science & Technology': 1}
Michael commented, on May 12, 2008 at 12:23 p.m.:
OpenCalais will be updating the site this week. Check it out at the end of the week.