Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Markov loves those dots! #11

Open
vphantom opened this issue Feb 9, 2016 · 3 comments
Open

Markov loves those dots! #11

vphantom opened this issue Feb 9, 2016 · 3 comments

Comments

@vphantom
Copy link
Member

vphantom commented Feb 9, 2016

(Hopefully I'm not speaking out of turn here.) Yesterday in the stream, one thing which HoomanBot did quite frequently is use single periods as words, in sequences:

something something . . . . something . . something

I'd bet it's because hoomans themselves use those when they end a proposition or sentence with a fellow hooman's nickname which was tab-completed: Twitch adds a space when we do so. Therefore, it might make sense to filter out or disallow single punctuation characters from becoming "words". (And I mean just punctuation; I can see "=" and such being useful.)

@ryonday
Copy link
Member

ryonday commented Feb 16, 2016

Good catch.

So my usage of RiTA is very crude currently. I generate a random number of sentences (1-3, off the top of my head). Rita returns these as an array of Strings, and I append them all together adding a '.' to the end of each one. I did this blithely without really examining what RiTA was returning.

Anyway, I think what we're seeing here happens for two reasons:

  1. RiTA will generate empty sentences for some reason. I'm not filtering these out.
  2. RiTA appears to end each sentence with the punctuation that it feels appropriate for what it generated, so my own periods are completely superfluous.

So what I'll do for this is:

  1. Change to using the Java streaming API to filter out generated sentences without content (blank or null)
  2. Stop appending a '.' to sentences.

@ryonday
Copy link
Member

ryonday commented Feb 16, 2016

Additionally, I will attempt some simple and sane sanitation on the RiTA input to try to keep things sane. The better the input, the better the output. Do you have any examples of what you mean by "Twitch adds a space when we do so" or other examples for what you're talking about here? (So I can write unit tests).

@vphantom
Copy link
Member Author

That was the most obvious case that came to mind. Being pedantic, I backspace after an auto-complete to add my period or comma straight after the name, but most people probably don't. :)

If I could get my hands on a chat log I could check how often it actually happened, although I'm starting to suspect that all the superfluous periods we saw from the bot may have been the addition you mention above.

If you're already massaging incoming text from the chat, it could be prudent to strip whitespace between characters and postfix punctuation early on. I'm rusty in PCRE's but something like this assuming plain-text input:

s/([a-zA-Z0-9_-])\s+([,.;:'"!?)])/\1\2/g

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants