-
-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose chunkSize parameter in CallDuplexConsensusReads #867
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -28,6 +28,7 @@ import com.fulcrumgenomics.FgBioDef._ | |
import com.fulcrumgenomics.bam.OverlappingBasesConsensusCaller | ||
import com.fulcrumgenomics.bam.api.{SamOrder, SamSource, SamWriter} | ||
import com.fulcrumgenomics.cmdline.{ClpGroups, FgBioTool} | ||
import com.fulcrumgenomics.commons.collection.ParIterator.DefaultChunkSize | ||
import com.fulcrumgenomics.sopt.clp | ||
import com.fulcrumgenomics.commons.io.Io | ||
import com.fulcrumgenomics.commons.util.LazyLogging | ||
|
@@ -108,6 +109,12 @@ class CallDuplexConsensusReads | |
val maxReadsPerStrand: Option[Int] = None, | ||
@arg(doc="The number of threads to use while consensus calling.") val threads: Int = 1, | ||
@arg(doc="Consensus call overlapping bases in mapped paired end reads") val consensusCallOverlappingBases: Boolean = true, | ||
@arg(doc=""" | ||
|Pull reads from this many source molecules into memory for multi-threaded processing. | ||
|Using a smaller value will require less memory but will negatively impact processing speed. | ||
|For very large family sizes, a smaller value may be necessary to reduce memory usage. | ||
|This value is only used when `--threads > 1`. | ||
""") val maxSourceMoleculesInMemory: Int = DefaultChunkSize | ||
mjhipp marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) extends FgBioTool with LazyLogging { | ||
|
||
Io.assertReadable(input) | ||
|
@@ -142,7 +149,7 @@ class CallDuplexConsensusReads | |
maxReadsPerStrand = maxReadsPerStrand.getOrElse(VanillaUmiConsensusCallerOptions.DefaultMaxReads) | ||
) | ||
val progress = ProgressLogger(logger, unit=1000000) | ||
val iterator = new ConsensusCallingIterator(inIter, caller, Some(progress), threads) | ||
val iterator = new ConsensusCallingIterator(inIter, caller, Some(progress), threads, maxSourceMoleculesInMemory) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you please add the same to |
||
out ++= iterator | ||
progress.logLast() | ||
|
||
|
Original file line number | Diff line number | Diff line change | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -40,7 +40,7 @@ import com.fulcrumgenomics.util.ProgressLogger | |||||||||||||
* @param caller the consensus caller to use to call consensus reads | ||||||||||||||
* @param progress an optional progress logger to which to log progress in input reads | ||||||||||||||
* @param threads the number of threads to use. | ||||||||||||||
* @param chunkSize parallel process in chunkSize units; will cause 8 * chunkSize records to be held in memory | ||||||||||||||
* @param chunkSize across the input [[SamRecord]]s from this many source molecules at a time | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The new param doc doesn't track for me. Based on my comment below:
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this is true, then presumably in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @tfenne @nh13 Many of these comments, here and elsewhere in the review, go against my interpretation of how Only one chunk is pulled into memory at a time at the "front" of the The It was based on this interpretation that I do not divide by |
||||||||||||||
*/ | ||||||||||||||
class ConsensusCallingIterator[ConsensusRead <: SimpleRead](sourceIterator: Iterator[SamRecord], | ||||||||||||||
caller: UmiConsensusCaller[ConsensusRead], | ||||||||||||||
|
@@ -67,10 +67,11 @@ class ConsensusCallingIterator[ConsensusRead <: SimpleRead](sourceIterator: Iter | |||||||||||||
groupingIterator.flatMap(caller.consensusReadsFromSamRecords) | ||||||||||||||
} | ||||||||||||||
else { | ||||||||||||||
ParIterator(groupingIterator, threads=threads).flatMap { rs => | ||||||||||||||
ParIterator(groupingIterator, threads=threads, chunkSize=chunkSize).map { rs => | ||||||||||||||
val caller = callers.get() | ||||||||||||||
caller.synchronized { caller.consensusReadsFromSamRecords(rs) } | ||||||||||||||
}.toAsync(chunkSize * 8) | ||||||||||||||
}.toAsync(chunkSize).flatten | ||||||||||||||
mjhipp marked this conversation as resolved.
Show resolved
Hide resolved
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am confused with why the chunk size passed to
Suggested change
|
||||||||||||||
// Flatten AFTER pulling through ParIterator to keep input chunks in phase with output | ||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If I understand this correctly, we previously got chunks out of order in some cases?
Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The concern is not about records out of order, but about the number of chunks that are getting pulled through the iterator at a time. Clint's diagram sums it up better I think In this instance, I am referring to a "source molecule" as all records within a group output by The goal of the change in flattening is that the same number of source molecules (not SAM records) are pulled through the end as the beginning. There will be many fewer SAM records after flattening, since we go from If we flatten before, and use |
||||||||||||||
} | ||||||||||||||
} | ||||||||||||||
|
||||||||||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not strictly true, since the
ParIterator
will havethreads * maxSourceMoleculesInMemory
, and so will theAsyncIterator
that wraps it, so it should be2 * threads * maxSourceMoleculesInMemory
I think.