Skip to content

Commit

Permalink
Adaptations for OpenAIRE, log4Net logging, Semantic Annotations to Po…
Browse files Browse the repository at this point in the history
…stgres
  • Loading branch information
hmetaxa committed Oct 6, 2017
1 parent 831d4b2 commit 5d19fda
Show file tree
Hide file tree
Showing 131 changed files with 141,703 additions and 731 deletions.
33 changes: 31 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,31 @@
# MVTopicModel
Multi View Topic Model
# MVTopicModel: Probabilistic Multi View Topic Modelling engine
Omiros Metaxas

### General Notes
Non-parametric Multi-View Topic Model (MV-HDP) that extends well-established Hierarchical Dirichlet Process (HDP)
incorporating a novel Interacting Pólya Urn scheme (IUM) to model per-document topic distribution.
This way, MV-HDP combines interaction and reinforcement addressing the following major challenges:
1) multi-view modeling leveraging statistical strength among different views,
2) estimation and adaptation to the extent of correlation between the different views that is automatically inferred during inference and
3) scalable inference on massive, real world datasets. The latter is achieved through a parallel Gibbs sampling scheme that utilizes efficient F+Tree data structure.
We consider that estimating the right number of topics is not our primary goal especially in collections of that size.
So, we have implemented a truncated version of the proposed model where the goal is to better estimate priors and model parameters given a maximum number of topics.
Although initially we extended MALLET’s efficient parallel topic model with Sparse LDA sampling, we end up implementing a very different parallel implementation based on F+Trees
that: 1) it is more readable and extensible,
2) it is usually faster in real world big datasets especially in multi-view settings
3) shares model related data structures across threads (contrary to MALLET where each thread retains each own model copy).
We update the model using background threads based on a lock free, queue based scheme minimizing staleness.
Our implementation can scale to massive datasets (over one million documents with meta-data and links) on a single computer.

Related classes in package cc.mallet.topics:
FastQMVWVParallelTopicModel: Main class
FastQMVWVUpdaterRunnable: Model updating
FastQMVWVWorkerRunnable: Gibbs Sampling
FastQMVWVTopicModelDiagnostics: Related diagnostics
PTMExperiment: Example of how we can load data from SQLite, run MV-Topic Models, calc similarities etc

Example results (multi view topics on Full ACM & OpenAccess PubMed corpora):
https://1drv.ms/f/s!Aul4avjcWIHpg-Ara0PZzqHeOkyGIw

### Running from Command line
java -Xms2G -Xmx28G -cp "MVTopicModel-1.0-SNAPSHOT.jar;lib/*" org.madgik.MVTopicModel.PTMFlow
Binary file added Testoutput/MVTopicModel-1.0-SNAPSHOT.jar
Binary file not shown.
Empty file.
6 changes: 6 additions & 0 deletions Testoutput/MVTopicModel-1.0-SNAPSHOT/META-INF/MANIFEST.MF
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Manifest-Version: 1.0
Archiver-Version: Plexus Archiver
Created-By: Apache Maven
Built-By: omiros
Build-Jdk: 1.8.0_91

Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
#Generated by Maven
#Mon Jul 24 18:19:49 EEST 2017
version=1.0-SNAPSHOT
groupId=org.madgik
artifactId=MVTopicModel
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.madgik</groupId>
<artifactId>MVTopicModel</artifactId>
<version>1.0-SNAPSHOT</version>
<packaging>jar</packaging>
<dependencies>
<dependency>
<groupId>cc.mallet</groupId>
<artifactId>mallet</artifactId>
<version>2.0.8</version>
</dependency>
<dependency>
<groupId>org.json</groupId>
<artifactId>json</artifactId>
<version>20140107</version>
</dependency>
<dependency>
<groupId>commons-httpclient</groupId>
<artifactId>commons-httpclient</artifactId>
<version>3.1</version>
</dependency>
<dependency>
<groupId>net.sf.trove4j</groupId>
<artifactId>trove4j</artifactId>
<version>3.0.3</version>
</dependency>
<dependency>
<groupId>com.sree.textbytes</groupId>
<artifactId>jtopia</artifactId>
<version>0.0.3</version>
</dependency>
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>42.1.1.jre7</version>
</dependency>
<dependency>
<groupId>org.xerial</groupId>
<artifactId>sqlite-jdbc</artifactId>
<version>3.8.11.1</version>
</dependency>
<dependency>
<groupId>trove</groupId>
<artifactId>trove</artifactId>
<version>2.0.4</version>
</dependency>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
<version>1.2.16</version>
</dependency>
</dependencies>
<repositories>
<!-- for JWNL -->
<repository>
<id>opennlp.sourceforge.net</id>
<url>http://opennlp.sourceforge.net/maven2</url>
</repository>
<!-- for Trove -->
<repository>
<id>maven.ontotext.com</id>
<url>http://maven.ontotext.com/content/repositories/public/</url>
</repository>
<!-- for MTJ -->
<repository>
<id>repo.scalanlp.org</id>
<url>http://repo.scalanlp.org/repo</url>
</repository>
<repository>
<id>xerial</id>
<url>http://repo1.maven.org/maven2/org/xerial</url>
</repository>


</repositories>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
</project>
16 changes: 16 additions & 0 deletions Testoutput/MVTopicModel-1.0-SNAPSHOT/config.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# To change this license header, choose License Headers in Project Properties.
# To change this template file, choose Tools | Templates
# and open the template in the editor.
TopicsNumber=400
Iterations=400
TopWords = 20
NumModalities=3
NumOfThreads=4
NumOfChars=4000
BurnIn=50
OptimizeInterval=50
PruneCnt=800
PruneLblCnt=80
PruneMaxPerc=1;
PruneMinPerc=0.1;
SQLConnectionString = jdbc:postgresql://localhost:5432/Tender?user=postgres&password=postgres&ssl=false
48 changes: 48 additions & 0 deletions Testoutput/MVTopicModel-1.0-SNAPSHOT/log4j.properties
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
#------------------------------------------------------------------------------
#
# The following properties set the logging levels and log appender. The
# log4j.rootCategory variable defines the default log level and one or more
# appenders. For the console, use 'S'. For the daily rolling file, use 'R'.
# For an HTML formatted log, use 'H'.
#
# To override the default (rootCategory) log level, define a property of the
# form (see below for available values):
#
# log4j.logger. =
#
# Available logger names:
# TODO
#
# Possible Log Levels:
# FATAL, ERROR, WARN, INFO, DEBUG
#
#------------------------------------------------------------------------------
log4j.rootCategory=INFO, S, R

#log4j.logger.com.dappit.Dapper.parser=ERROR
#log4j.logger.org.w3c.tidy=FATAL

#------------------------------------------------------------------------------
#
# The following properties configure the console (stdout) appender.
# See http://logging.apache.org/log4j/docs/api/index.html for details.
#
#------------------------------------------------------------------------------
log4j.appender.S = org.apache.log4j.ConsoleAppender
log4j.appender.S.layout = org.apache.log4j.PatternLayout
log4j.appender.S.layout.ConversionPattern = %d{yyyy-MM-dd HH:mm:ss} %c{1} [%p] %m%n

#------------------------------------------------------------------------------
#
# The following properties configure the Daily Rolling File appender.
# See http://logging.apache.org/log4j/docs/api/index.html for details.
#
#------------------------------------------------------------------------------
log4j.appender.R = org.apache.log4j.DailyRollingFileAppender
log4j.appender.R.File = logs/MVTopicModelling.log
log4j.appender.R.Append = true
log4j.appender.R.DatePattern = '.'yyy-MM-dd
log4j.appender.R.layout = org.apache.log4j.PatternLayout
log4j.appender.R.layout.ConversionPattern = %d{yyyy-MM-dd HH:mm:ss} %c{1} [%p] %m%n


Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Loading

0 comments on commit 5d19fda

Please sign in to comment.