Skip to content

Octavian-ai/experiments

Repository files navigation

Join our Discord >> https://discord.gg/a2Z82Te

Review prediction

Introduction

The aim of this experiment is to investigate the performance of

  1. different NN approaches
  2. different graph representations of the same data

on a simple synthetic prediction task.

The Task

We model personalised recommendations as a system containing people, products and recommendations. In our system every product has a style and each person has a style preference. People can make reviews of products. In our system the review score will be a function Y(...) of the person's style preference and the product's style. We call this function the opinion function i.e.:

review_score = Y(product_style, person_style_preference)

We will generate data using this model. We will then use this synthetic data to investigate how effective various ML approaches on the data set are at learning the behaviour of this system.

If necessary we can change the opinion function Y(...) to increase or decrease the difficulty of the task.

The Synthetic Data

The synthetic data for this task can be varied in various ways:

  1. Change which information is hidden e.g. we could hide product_style, style_preference or both.
  2. Change the representation of the key properties e.g. reviews/styles and preferences could be boolean, categorical, continuous scalars or even multi dimensional vectors.
  3. Change how the data is represented as a graph e.g. reviews could be nodes in their own right, or they could be edges with properties, product_style could be a property on a product node or product_style could be a seperate node connected to a product node by a HAS_STYLE relationship (edge).
  4. Add additional meaningless or semi-meaningless information to the training data.

We will generate different data sets to qualitatively investigate different ML approaches on the same basic system.

Evaluation Tasks

We are interested in four different evaluation tasks depending on whether the person or product is included in the training set or not:

  • new product == unknown at training time i.e. not in training set or validation set
  • new person == unknown at training time i.e. not in training set or validation set
  • existing product == known at training time i.e. present in training set
  • existing person == known at training time i.e. present in training set

The evaluation tasks we are interested in are, how well can you predict the person's review? Given:

  1. new product and new person
  2. existing product and new person
  3. new product and existing person
  4. existing product and existing person

Approach

Although we have a synthetic system for which we can generate more data we want to get into good habits for working with "real" data. So we will attempt to blind the ML system to the fact that we are working with synthetic data and not rely on our ability to generate more information at will.

It will be the responsibility of the ML part of the system to split the data into Test / Train and Validation sets. However for each data set that we generate we will keep back a small portion to make up a "golden" test set which is only to be used at the very end of our investigation. This is to perform a final test of the ML predictor, one which we haven't had the opportunity to optimise the meta-parameters for.

Because of the three different evaluation tasks it will be necessary for us to keep back three different golden test sets, of a large enough size to test the system regardless of the test/training split. We will keep the following volumes of golden test data:

  1. INDEPENDENT: A completely independent data set containing 1000 reviews
  2. NEW_PEOPLE: new people + their reviews of existing products containing approx 2000 reviews
  3. NEW_PRODUCTS: new products + reviews of them by existing people containing approx 2000 reviews
  4. EXISTING: 2000 additional reviews between existing people and products.

The Data Sets

Data Set 1: A simple binary preference system

Products have a binary style and people have a binary preference.

  • All variables will be 'public' in the data set

Product Style

  • product_style will be categorical with two mutually exclusive elements (A and B).
  • The distribution of product styles will be uniform i.e. Approx 50% of products will have style A and 50% will have style B.

Style Preference

  • person_style_preference will be categorical with two mutually exclusive elements (likes_A_dislikes_B | likes_B_dislikes_A ).
  • The distribution of product styles will be uniform i.e. Approx 50% of people will like style A and 50% will like style B.

Reviews and Opinion Function

  • review_score will be boolean (1 for a positive review and 0 for a negative review)
  • Each person will have made either 1 or 2 reviews. The mean number of reviews-per-person will be approx 1.5 i.e. approx 50% will have made 2 reviews and 50% will have made 1 review.
  • review_score is the dot product of the product_style and person_style_preference normalised to the range of 0 to 1

Note: having people with 0 reviews would be useless since you cannot train or validate/test using them.

Note: fixing the number of reviews-per-person would restrict the graph structure too much and open up the problem to approaches that we aren't interested in right now.

Entity Ratios and Data Set Size

I basically made these up. Intuitively the reviews-per-product and reviews-per-person parameters affect how much we can infer about people/product hidden variables. I like the idea of those figures being very different so we can see how systems cope with that distinction.

  • people:products = 50:1
  • people:reviews = 1:1.5
  • reviews:products = 75:1

Data set size: 12000 reviews / 160 products / 8000 people

n.b. because we assign the reviews randomly some products may not have reviews, but it is relatively unlikely.

Graph Schema

PERSON(id: , style_preference: A|B, is_golden: True|False) -- WROTE(is_golden: True|False) -> REVIEW(id: , score: 1|0, is_golden: True|False) -- OF(is_golden: True|False) --> PRODUCT(id: , style: A:B, is_golden: True|False)

Data generation algorithm

  1. Instantiate all products for public data set and write to Neo, keeping an array of the ids.
  2. Iteratively instantiate people, decide how many reviews that person will have made (probabilistically)
  3. For each review that the person has to make randomly choose a product to review (without replacement)
  4. Calculate the review score and submit the Person + their reviews to Neo
  5. Read the data back out of neo and validate the entity ratios
  6. Create the golden test sets:
  • NEW_PEOPLE: create 2000/reviews_per_person new people + their reviews of randomly selected (with replacement) existing products.
  • NEW_PRODUCTS: create 2000/reviews_per_product new products, have randomly selected (with replacement) people review them.
  • EXISTING randomly pick 2000 people (with replacement) have each of them review a randomly selected (with replacement) product
  • INDEPENDENT is easy, but best to leave till last to avoid confusion - just repeat the basic data generation from scratch

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •