Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Best Practice for training data? #19

Open
Shotgunosine opened this issue Sep 15, 2015 · 1 comment
Open

Best Practice for training data? #19

Shotgunosine opened this issue Sep 15, 2015 · 1 comment

Comments

@Shotgunosine
Copy link

I'm trying to improve performance of the parser on a fairly messy list containing individuals, households, and corporations. For individuals and households the parser works great.
For corporations I see lots of listings like:
Acme LLC, A Delaware Limited Liability Company

Currently the tagging for that will be:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | CorporationName             |
| DELAWARE  | CorporationName             |
| LIMITED   | CorporationName             |
| LIABILITY | CorporationName             |
| COMPANY   | CorporationNameOrganization |

I think ideally the result would be something like:

| ACME | CorporationName             |
| LLC.,     | CorporationLegalType        |
| A         | Article                      |
| DELAWARE  | Location                    |
| LIMITED   | CorporationNameOrganization |
| LIABILITY | CorporationNameOrganization |
| COMPANY   | CorporationNameOrganization |

In addition to adding "Article" and "Location" labels, I was thinking I would add edit distance to a state name as a feature.

My question is about how much training data I should use. Is it purely a situation where more examples will be better? Or should I add a few core examples and then augment those with problem cases as they come up?

@fgregg
Copy link
Contributor

fgregg commented Oct 26, 2015

That's a pretty strange one.

So right now we are using CorporationNameOrganization for things like

  • Brandeis University
  • Concerned Citizens Council
  • ACME Media Group

This is not really what's going on with A Deleware Limited Liability Company I would say that that is not really part of the name at all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants