Skip to content

Commit

Permalink
Update Readme
Browse files Browse the repository at this point in the history
  • Loading branch information
hmetaxa committed Oct 6, 2017
1 parent ceb54f7 commit 83a2508
Showing 1 changed file with 14 additions and 9 deletions.
23 changes: 14 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,29 @@ This way, MV-HDP combines interaction and reinforcement addressing the following
1) multi-view modeling leveraging statistical strength among different views,
2) estimation and adaptation to the extent of correlation between the different views that is automatically inferred during inference and
3) scalable inference on massive, real world datasets. The latter is achieved through a parallel Gibbs sampling scheme that utilizes efficient F+Tree data structure.

We consider that estimating the right number of topics is not our primary goal especially in collections of that size.
So, we have implemented a truncated version of the proposed model where the goal is to better estimate priors and model parameters given a maximum number of topics.

Although initially we extended MALLET’s efficient parallel topic model with Sparse LDA sampling, we end up implementing a very different parallel implementation based on F+Trees
that: 1) it is more readable and extensible,
2) it is usually faster in real world big datasets especially in multi-view settings
3) shares model related data structures across threads (contrary to MALLET where each thread retains each own model copy).
that:
- it is more readable and extensible,
- it is usually faster in real world big datasets especially in multi-view settings
- shares model related data structures across threads (contrary to MALLET where each thread retains each own model copy)
We update the model using background threads based on a lock free, queue based scheme minimizing staleness.
Our implementation can scale to massive datasets (over one million documents with meta-data and links) on a single computer.

Related classes in package cc.mallet.topics:
FastQMVWVParallelTopicModel: Main class
FastQMVWVUpdaterRunnable: Model updating
FastQMVWVWorkerRunnable: Gibbs Sampling
FastQMVWVTopicModelDiagnostics: Related diagnostics
PTMExperiment: Example of how we can load data from SQLite, run MV-Topic Models, calc similarities etc
- FastQMVWVParallelTopicModel: Main class
- FastQMVWVUpdaterRunnable: Model updating
- FastQMVWVWorkerRunnable: Gibbs Sampling
- FastQMVWVTopicModelDiagnostics: Related diagnostics
- PTMFlow, PTMExperiment: Example of how we can load data from postrgres or SQLite DBs, run MV-Topic Models, calc similarities etc

Example results (multi view topics on Full ACM & OpenAccess PubMed corpora):

https://1drv.ms/f/s!Aul4avjcWIHpg-Ara0PZzqHeOkyGIw

### Running from Command line
java -Xms2G -Xmx28G -cp "MVTopicModel-1.0-SNAPSHOT.jar;lib/*" org.madgik.MVTopicModel.PTMFlow
Topic modeling: java -Xms2G -Xmx28G -cp "MVTopicModel-1.0-SNAPSHOT.jar;lib/*" org.madgik.MVTopicModel.PTMFlow
Semantic Annotation: java -Xms2G -Xmx28G -cp "MVTopicModel-1.0-SNAPSHOT.jar;lib/*" org.madgik.dbpediaspotlightclient.DBpediaAnnotator

0 comments on commit 83a2508

Please sign in to comment.