Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dualsum #50

Merged
merged 42 commits into from
Nov 30, 2023
Merged
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
dc81d44
Try to isolate Breeze
kwalcock Nov 8, 2023
c777301
WIP
kwalcock Nov 8, 2023
6018fdf
Rename to Math, transpose
kwalcock Nov 8, 2023
eab249a
Complete Breeze isolation
kwalcock Nov 9, 2023
97e3d2b
Fix it again
kwalcock Nov 9, 2023
2ab8e60
Try Emjl math
kwalcock Nov 2, 2023
2f91702
Test BreezeMath
kwalcock Nov 9, 2023
f038c40
Unit test Emjl math
kwalcock Nov 9, 2023
92c269e
Organize math better
kwalcock Nov 9, 2023
42b2679
Clean up types
kwalcock Nov 9, 2023
c3851f4
Fix up test app
kwalcock Nov 9, 2023
e50c05d
Test better
kwalcock Nov 9, 2023
b21c0a6
Make column matrix directly
kwalcock Nov 10, 2023
0d8a1ed
Make column matrix directly
kwalcock Nov 10, 2023
8148de7
Remove OnnxMath
kwalcock Nov 10, 2023
fd29280
Rename things
kwalcock Nov 10, 2023
b7c2326
Move the BlasInstanceApp to apps subproject
kwalcock Nov 10, 2023
fb0d70d
Only include necessary ejml parts
kwalcock Nov 10, 2023
cddb890
Add CommonsMath
kwalcock Nov 10, 2023
79084ca
Add some mkRowVector to tests
kwalcock Nov 10, 2023
49ac525
Add some logging
kwalcock Nov 11, 2023
f16b818
use sum instead of concat in dual mode
MihaiSurdeanu Nov 13, 2023
a75b72c
Add CluMath
kwalcock Nov 14, 2023
72d210f
Remove .t from main interface
kwalcock Nov 14, 2023
ff7c5d7
Write some comment on CluMath
kwalcock Nov 14, 2023
f3b5362
Hide math files that aren't used
kwalcock Nov 14, 2023
2721d6e
sum not concat in scala
MihaiSurdeanu Nov 15, 2023
21c0f34
use sum instead of concat everywhere.
MihaiSurdeanu Nov 16, 2023
a41971e
Merge branch 'dualsum' into kwalcock/mathWithDualSum
kwalcock Nov 16, 2023
aa39eb9
Do the sum
kwalcock Nov 16, 2023
c095106
Simplify
kwalcock Nov 16, 2023
21cf682
Remove log
kwalcock Nov 16, 2023
b3d901d
Format
kwalcock Nov 16, 2023
32e07af
deberta full model
MihaiSurdeanu Nov 22, 2023
8bc4471
Merge branch 'dualsum' into kwalcock/mathWithDualSum
kwalcock Nov 29, 2023
f207514
Clean up the apps
kwalcock Nov 29, 2023
635bfa7
Remove breeze and go with ejml
kwalcock Nov 29, 2023
9c7463f
Add documentation
kwalcock Nov 29, 2023
cbc99d7
Shade EJML
kwalcock Nov 29, 2023
e7e417e
Shade the other class as well
kwalcock Nov 30, 2023
70f44f2
Update CHANGES
kwalcock Nov 30, 2023
09218f7
Merge pull request #51 from clulab/kwalcock/mathWithDualSum
kwalcock Nov 30, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGES.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
+ **0.6.2** - Shade additional package internal to EJML
+ **0.6.1** - Shade EJML
+ **0.6.0** - Calculate with EJML rather than Breeze
+ **0.6.0** - Use sum instead of concat
+ **0.5.0** - Support Linux on aarch64
+ **0.5.0** - Isolate dependencies on models to the apps subproject
+ **0.4.0** - Account for maxTokens
5 changes: 3 additions & 2 deletions apps/build.sbt
Original file line number Diff line number Diff line change
@@ -7,8 +7,9 @@ resolvers ++= Seq(

libraryDependencies ++= {
Seq(
"org.clulab" % "roberta-onnx-model" % "0.1.0",
"org.clulab" % "deberta-onnx-model" % "0.1.0",
"org.clulab" % "deberta-onnx-model" % "0.2.0",
"org.clulab" % "electra-onnx-model" % "0.2.0",
"org.clulab" % "roberta-onnx-model" % "0.2.0",
"org.scalatest" %% "scalatest" % "3.2.15" % "test"
)
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package org.clulab.scala_transformers.encoder.apps
package org.clulab.scala_transformers.apps

import dev.ludovic.netlib.blas.{BLAS, JavaBLAS, NativeBLAS}

Original file line number Diff line number Diff line change
@@ -5,7 +5,12 @@ import org.clulab.scala_transformers.tokenizer.LongTokenization
import org.clulab.scala_transformers.tokenizer.jni.ScalaJniTokenizer

object LoadExampleFromFileApp extends App {
val baseName = args.lift(0).getOrElse("../tcmodel")
// Choose one of these.
val defaultBaseName = "../models/microsoft_deberta_v3_base_mtl/avg_export"
// val defaultBaseName = "../models/google_electra_small_discriminator_mtl/avg_export"
// val defaultBaseName = "../models/roberta_base_mtl/avg_export"

val baseName = args.lift(0).getOrElse(defaultBaseName)
val tokenClassifierLayout = new TokenClassifierLayout(baseName)
val tokenClassifierFactory = new TokenClassifierFactoryFromFiles(tokenClassifierLayout)
val words = Array("EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", ".")
Original file line number Diff line number Diff line change
@@ -5,7 +5,11 @@ import org.clulab.scala_transformers.tokenizer.LongTokenization
import org.clulab.scala_transformers.tokenizer.jni.ScalaJniTokenizer

object LoadExampleFromResourceApp extends App {
val baseName = "/org/clulab/scala_transformers/models/roberta_base_mtl/avg_export"
// Choose one of these.
val baseName = "/org/clulab/scala_transformers/models/microsoft_deberta_v3_base_mtl/avg_export"
// val baseName = "/org/clulab/scala_transformers/models/google_electra_small_discriminator_mtl/avg_export"
// val baseName = "/org/clulab/scala_transformers/models/roberta_base_mtl/avg_export"

val tokenClassifierLayout = new TokenClassifierLayout(baseName)
val tokenClassifierFactory = new TokenClassifierFactoryFromResources(tokenClassifierLayout)
val words = Array("EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", ".")
Original file line number Diff line number Diff line change
@@ -2,17 +2,14 @@ package org.clulab.scala_transformers.apps

import org.clulab.scala_transformers.encoder.TokenClassifier

/*
import java.io.File

import org.clulab.scala_transformers.tokenizer.jni.ScalaJniTokenizer
import org.clulab.scala_transformers.tokenizer.LongTokenization
*/

object TokenClassifierExampleApp extends App {
//val tokenClassifier = TokenClassifier.fromFiles("../../scala-transformers-models/roberta-base-mtl/avg_export")
val tokenClassifier = TokenClassifier.fromFiles("../microsoft-deberta-v3-base-mtl/avg_export")
//val tokenClassifier = TokenClassifier.fromResources("/org/clulab/scala_transformers/models/microsoft_deberta_v3_base_mtl/avg_export")
// Choose one of these.
val tokenClassifier = TokenClassifier.fromFiles("../models/microsoft_deberta_v3_base_mtl/avg_export")
// val tokenClassifier = TokenClassifier.fromResources("/org/clulab/scala_transformers/models/microsoft_deberta_v3_base_mtl/avg_export")
// val tokenClassifier = TokenClassifier.fromFiles("../models/google_electra_small_discriminator_mtl/avg_export")
// val tokenClassifier = TokenClassifier.fromResources("/org/clulab/scala_transformers/models/google_electra_small_discriminator_mtl/avg_export")
// val tokenClassifier = TokenClassifier.fromFiles("../models/roberta_base_mtl/avg_export")
// val tokenClassifier = TokenClassifier.fromResources("/org/clulab/scala_transformers/models/roberta_base_mtl/avg_export")

//val words = Seq("EU", "rejects", "German", "call", "to", "boycott", "British", "lamb", ".")
val words = Seq("John", "Doe", "went", "to", "China", ".")
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
package org.clulab.scala_transformers.apps

import org.clulab.scala_transformers.common.Timers
import org.clulab.scala_transformers.encoder.{EncoderMaxTokensRuntimeException, TokenClassifier}
import org.clulab.scala_transformers.tokenizer.LongTokenization

import scala.io.Source

object TokenClassifierTimerApp extends App {

class TimedTokenClassifier(tokenClassifier: TokenClassifier) extends TokenClassifier(
tokenClassifier.encoder, tokenClassifier.maxTokens, tokenClassifier.tasks, tokenClassifier.tokenizer
) {
val tokenizeTimer = Timers.getOrNew("Tokenizer")
val forwardTimer = Timers.getOrNew("Encoder.forward")
val predictTimers = tokenClassifier.tasks.indices.map { index =>
val name = tasks(index).name

Timers.getOrNew(s"Encoder.predict $index\t$name")
}

// NOTE: This should be copied from the base class and then instrumented with timers.
override def predictWithScores(words: Seq[String], headTaskName: String = "Deps Head"): Array[Array[Array[(String, Float)]]] = {
// This condition must be met in order for allLabels to be filled properly without nulls.
// The condition is not checked at runtime!
// if (tasks.exists(_.dual))
// require(tasks.count(task => !task.dual && task.name == headTaskName) == 1)

// tokenize to subword tokens
val tokenization = tokenizeTimer.time {
LongTokenization(tokenizer.tokenize(words.toArray))
}
val inputIds = tokenization.tokenIds
val wordIds = tokenization.wordIds
val tokens = tokenization.tokens

if (inputIds.length > maxTokens) {
throw new EncoderMaxTokensRuntimeException(s"Encoder error: the following text contains more tokens than the maximum number accepted by this encoder ($maxTokens): ${tokens.mkString(", ")}")
}

// run the sentence through the transformer encoder
val encOutput = forwardTimer.time {
encoder.forward(inputIds)
}

// outputs for all tasks stored here: task x tokens in sentence x scores per token
val allLabels = new Array[Array[Array[(String, Float)]]](tasks.length)
// all heads predicted for every token
// dimensions: token x heads
var heads: Option[Array[Array[Int]]] = None

// now generate token label predictions for all primary tasks (not dual!)
for (i <- tasks.indices) {
if (!tasks(i).dual) {
val tokenLabels = predictTimers(i).time {
tasks(i).predictWithScores(encOutput, None, None)
}
val wordLabels = TokenClassifier.mapTokenLabelsAndScoresToWords(tokenLabels, tokenization.wordIds)
allLabels(i) = wordLabels

// if this is the task that predicts head positions, then save them for the dual tasks
// we save all the heads predicted for each token
if (tasks(i).name == headTaskName) {
heads = Some(tokenLabels.map(_.map(_._1.toInt)))
}
}
}

// generate outputs for the dual tasks, if heads were predicted by one of the primary tasks
// the dual task(s) must be aligned with the heads.
// that is, we predict the top label for each of the head candidates
if (heads.isDefined) {
//println("Tokens: " + tokens.mkString(", "))
//println("Heads:\n\t" + heads.get.map(_.slice(0, 3).mkString(", ")).mkString("\n\t"))
//println("Masks: " + TokenClassifier.mkTokenMask(wordIds).mkString(", "))
val masks = Some(TokenClassifier.mkTokenMask(wordIds))

for (i <- tasks.indices) {
if (tasks(i).dual) {
val tokenLabels = predictTimers(i).time {
tasks(i).predictWithScores(encOutput, heads, masks)
}
val wordLabels = TokenClassifier.mapTokenLabelsAndScoresToWords(tokenLabels, tokenization.wordIds)
allLabels(i) = wordLabels
}
}
}

allLabels
}
}

val verbose = false
val fileName = args.lift(0).getOrElse("../corpora/sentences/sentences.txt")
// Choose one of these.
val untimedTokenClassifier = TokenClassifier.fromFiles("../models/microsoft_deberta_v3_base_mtl/avg_export")
// val untimedTokenClassifier = TokenClassifier.fromFiles("../models/google_electra_small_discriminator_mtl/avg_export")
// val untimedTokenClassifier = TokenClassifier.fromFiles("../models/roberta_base_mtl/avg_export")

val tokenClassifier = new TimedTokenClassifier(untimedTokenClassifier)
val lines = {
val source = Source.fromFile(fileName)
val lines = source.getLines().take(100).toArray

source.close
lines
}
val elapsedTimer = Timers.getOrNew("Elapsed")

elapsedTimer.time {
lines.zipWithIndex/*.par*/.foreach { case (line, index) =>
println(s"$index $line")
if (index != 1382) {
val words = line.split(" ").toSeq
val allLabelSeqs = tokenClassifier.predictWithScores(words)

if (verbose) {
println(s"Words: ${words.mkString(", ")}")
for (layer <- allLabelSeqs) {
val words = layer.map(_.head) // Collapse the next layer by just taking the head.
val wordLabels = words.map(_._1)

println(s"Labels: ${wordLabels.mkString(", ")}")
}
}
}
}
}
Timers.summarize()
}
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
package org.clulab.scala_transformers.encoder.timer
package org.clulab.scala_transformers.common

import scala.collection.mutable.{HashMap => MutableHashMap}

21 changes: 20 additions & 1 deletion encoder/build.sbt
Original file line number Diff line number Diff line change
@@ -10,9 +10,16 @@ libraryDependencies ++= {
case Some((2, minor)) if minor < 12 => "1.0"
case _ => "2.1.0"
}
val ejmlVersion = "0.41" // Use this older version for Java 8.

Seq(
"org.scalanlp" %% "breeze" % breezeVersion,
// Choose one of these.
/// "org.apache.commons" % "commons-math3" % "3.6.1",
"org.ejml" % "ejml-core" % ejmlVersion,
"org.ejml" % "ejml-fdense" % ejmlVersion,
"org.ejml" % "ejml-simple" % ejmlVersion,
// "org.scalanlp" %% "breeze" % breezeVersion,

"com.microsoft.onnxruntime" % "onnxruntime" % "1.13.1",
"org.slf4j" % "slf4j-api" % "1.7.10"
)
@@ -21,3 +28,15 @@ libraryDependencies ++= {
fork := true

// assembly / mainClass := Some("com.keithalcock.tokenizer.scalapy.apps.ExampleApp")

enablePlugins(ShadingPlugin)
shadedDependencies ++= Set(
"org.ejml" % "ejml-core" % "<ignored>",
"org.ejml" % "ejml-fdense" % "<ignored>",
"org.ejml" % "ejml-simple" % "<ignored>"
)
shadingRules ++= Seq(
ShadingRule.moveUnder("org.ejml", "org.clulab.shaded"),
ShadingRule.moveUnder("pabeles.concurrency", "org.clulab.shaded")
)
validNamespaces ++= Set("org", "org.clulab")
4 changes: 2 additions & 2 deletions encoder/src/main/python/averaging_trainer.py
Original file line number Diff line number Diff line change
@@ -126,7 +126,7 @@ def print_some_params(self, model: TokenClassificationModel, msg: str) -> None:
ShortTaskDef("Chunking", "chunking/", "train.txt", "test.txt", "test.txt"),
#ShortTaskDef("Deps Head", "deps-wsj/", "train.heads", "dev.heads", "test.heads"),
#ShortTaskDef("Deps Label", "deps-wsj/", "train.labels", "dev.labels", "test.labels", dual_mode=True)
ShortTaskDef("Deps Head", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.heads", "dev.heads", "test.heads"),
ShortTaskDef("Deps Label", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.labels", "dev.labels", "test.labels", dual_mode=True)
ShortTaskDef("Deps Head", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.heads", "test.heads", "test.heads"),
ShortTaskDef("Deps Label", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.labels", "test.labels", "test.labels", dual_mode=True)
])
AveragingTrainer(tokenizer).train(tasks)
4 changes: 2 additions & 2 deletions encoder/src/main/python/clu_trainer.py
Original file line number Diff line number Diff line change
@@ -90,8 +90,8 @@ def compute_metrics(self, eval_pred: EvalPrediction) -> Dict[str, float]:
ShortTaskDef("NER", "conll-ner/", "train.txt", "dev.txt", "test.txt"),
ShortTaskDef("POS", "pos/", "train.txt", "dev.txt", "test.txt"),
ShortTaskDef("Chunking", "chunking/", "train.txt", "test.txt", "test.txt"), # this dataset has no dev
ShortTaskDef("Deps Head", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.heads", "dev.heads", "test.heads"),
ShortTaskDef("Deps Label", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.labels", "dev.labels", "test.labels", dual_mode=True)
ShortTaskDef("Deps Head", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.heads", "test.heads", "test.heads"), # dev is included in train
ShortTaskDef("Deps Label", "deps-combined/", "wsjtrain-wsjdev-geniatrain-geniadev.labels", "test.labels", "test.labels", dual_mode=True) # dev is included in train
#ShortTaskDef("Deps Head", "deps-wsj/", "train.heads", "dev.heads", "test.heads"),
#ShortTaskDef("Deps Label", "deps-wsj/", "train.labels", "dev.labels", "test.labels", dual_mode=True)
])
7 changes: 5 additions & 2 deletions encoder/src/main/python/token_classifier.py
Original file line number Diff line number Diff line change
@@ -223,7 +223,7 @@ def __init__(self, hidden_size: int, num_labels: int, task_id, dual_mode: bool=F
self.dropout = nn.Dropout(dropout_p)
self.dual_mode = dual_mode
self.classifier = nn.Linear(
hidden_size if not self.dual_mode else hidden_size * 2,
hidden_size, # if not self.dual_mode else hidden_size * 2, # USE SUM
num_labels
)
self.num_labels = num_labels
@@ -248,8 +248,11 @@ def concatenate(self, sequence_output, head_positions):
head_states = sequence_output[torch.arange(sequence_output.shape[0]).unsqueeze(1), long_head_positions]
#print(f"head_states.size = {head_states.size()}")
# Concatenate the hidden states from modifier + head.
modifier_head_states = torch.cat([sequence_output, head_states], dim=2)
#modifier_head_states = torch.cat([sequence_output, head_states], dim=2)
modifier_head_states = torch.add(sequence_output, head_states) # USE SUM
#print(f"modifier_head_states.size = {modifier_head_states.size()}")
#print("EXIT")
#exit(1)
return modifier_head_states

def forward(self, sequence_output, pooled_output, head_positions, labels=None, attention_mask=None, **kwargs):
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
package org.clulab.scala_transformers.encoder

import breeze.linalg._
import BreezeUtils._
import breeze.linalg._
import org.clulab.scala_transformers.encoder.math.BreezeMath

object BreezeExamples extends App {
val m = mkRowMatrix[Float](Array(Array(1f, 2f), Array(3f, 4f)))
val m = BreezeMath.mkMatrixFromRows(Array(Array(1f, 2f), Array(3f, 4f)))
println(m)

println("Row 0: " + m(0, ::))

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
package org.clulab.scala_transformers.encoder

import ai.onnxruntime.{OnnxTensor, OrtEnvironment, OrtSession}
import breeze.linalg.DenseMatrix
import org.clulab.scala_transformers.encoder.math.Mathematics.{Math, MathMatrix}

import java.io.DataInputStream
import java.util.{HashMap => JHashMap}
@@ -13,12 +13,12 @@ class Encoder(val encoderEnvironment: OrtEnvironment, val encoderSession: OrtSes
* @param batchInputIds First dimension is batch size (1 for a single sentence); second is sentence size
* @return Hidden states for the whole batch. The matrix dimension: rows = sentence size; columns = hidden state size
*/
def forward(batchInputIds: Array[Array[Long]]): Array[DenseMatrix[Float]] = {
def forward(batchInputIds: Array[Array[Long]]): Array[MathMatrix] = {
val inputs = new JHashMap[String, OnnxTensor]()
inputs.put("token_ids", OnnxTensor.createTensor(encoderEnvironment, batchInputIds))

val encoderOutput = encoderSession.run(inputs).get(0).getValue.asInstanceOf[Array[Array[Array[Float]]]]
val outputs = encoderOutput.map(BreezeUtils.mkRowMatrix(_))
val result: OrtSession.Result = encoderSession.run(inputs)
val outputs = Math.fromResult(result)
outputs
}

@@ -27,7 +27,7 @@ class Encoder(val encoderEnvironment: OrtEnvironment, val encoderSession: OrtSes
* @param inputIds Array of token ids for this sentence
* @return Hidden states for this sentence. The matrix dimension: rows = sentence size; columns = hidden state size
*/
def forward(inputIds: Array[Long]): DenseMatrix[Float] = {
def forward(inputIds: Array[Long]): MathMatrix = {
val batchInputIds = Array(inputIds)
forward(batchInputIds).head
}
Loading