Best Practice for training data? #19

Shotgunosine · 2015-09-15T15:26:45Z

I'm trying to improve performance of the parser on a fairly messy list containing individuals, households, and corporations. For individuals and households the parser works great.
For corporations I see lots of listings like:
Acme LLC, A Delaware Limited Liability Company

Currently the tagging for that will be:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | CorporationName             |
| DELAWARE  | CorporationName             |
| LIMITED   | CorporationName             |
| LIABILITY | CorporationName             |
| COMPANY   | CorporationNameOrganization |

I think ideally the result would be something like:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | Article                      |
| DELAWARE  | Location                    |
| LIMITED   | CorporationNameOrganization |
| LIABILITY | CorporationNameOrganization |
| COMPANY   | CorporationNameOrganization |

In addition to adding "Article" and "Location" labels, I was thinking I would add edit distance to a state name as a feature.

My question is about how much training data I should use. Is it purely a situation where more examples will be better? Or should I add a few core examples and then augment those with problem cases as they come up?

The text was updated successfully, but these errors were encountered:

fgregg · 2015-10-26T18:02:40Z

That's a pretty strange one.

So right now we are using CorporationNameOrganization for things like

Brandeis University
Concerned Citizens Council
ACME Media Group

This is not really what's going on with A Deleware Limited Liability Company I would say that that is not really part of the name at all.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best Practice for training data? #19

Best Practice for training data? #19

Shotgunosine commented Sep 15, 2015

fgregg commented Oct 26, 2015

Best Practice for training data? #19

Best Practice for training data? #19

Comments

Shotgunosine commented Sep 15, 2015

fgregg commented Oct 26, 2015