Most recent releases are shown at the top. Each release shows:
- New: New classes, methods, functions, etc
- Changed: Additional parameters, changes to inputs or outputs, etc
- Fixed: Bug fixes that don't change documented behaviour
- N/A
- N/A
- Remove references to
paper-qa
(#530) - Reduce memory footprint of
TopicModel.filter
(#531)
- N/A
- N/A
- Removed
tf_keras
as dependencies due to issues in varioius dependencies related to TF 2.16 and allow TF to prompt user for it (#528) - Removed auto-setting
TF_USE_LEGACY_KERAS
, as it causes problems intensorflow<2.16
(#528) - Unpin
transformers
due to incompatibilites with different versions of TensorFlow.
- N/A
- N/A
- Added
tf_keras
to dependencies and setUSE_TF_TF_USE_LEGACY_KERAS
(#525)
- N/A
- N/A
- temporarily pinning to
transformers==4.37.2
due to issue (#523) on Google Colab
- N/A
- Breaking Change: Removed the
ktrain.text.qa.generative_qa
module. Users should use our OnPrem.LLM for generative question-answering (#522)
- use arrays in
TextPredictor
due to possible issues withtf.Dataset
(#521)
- N/A
- Changed
shallownlp.classifier
API with respect to hyperparameters and defaults
- Ensure weight files in checkpoint folder have
val_loss
in file name (#519)
- N/A
- Changes to custom
eli5
andstellargraph
to support Python 3.11 (#515)
- Switch from unmaintained
cchardet
tocharset-normalizer
(#512) - Use
textract-py3
instead oftextract
(#511)
- N/A
- Breaking Change: The
generative_ai.LLM
class replacesgenerative_ai.GenerativeAI
is now powered by our OnPrem.LLM package (see example notebook). GenerativeQA
now recomendslangchain==0.0.240
- N/A
- N/A
- N/A
- Removed pin to
paper-qa==2.1.1
due to issue in latestlangchain
release. Added notification to installlangchain==0.0.180
- N/A
- N/A
- Removed pin on
scikit-learn
, aseli5-tf
repo was updated to supportscikit-learn>=1.3
(#505) - pin to
paper-qa==2.1.1
due to breaking changes (#506)
- N/A
- N/A
- Temporarily pin to
scikit-learn<1.3
to avoideli5
import error (#505) - Temporarily changed
generative_qa
imports to avoid `OPENAI_API_KEY error (#506)
- N/A
- N/A
- fix
eda.py
topic visualization to work withbokeh>=3.0.0
(#504)
- N/A
text.models
,vision.models
, andtabular.models
now all automatically set metrics to usebinary_accuracy
for multilabel problems
- fix
validate
to support multilabel classification problems (#498) - add a warning to
TransformerPreprocessor.get_classifier
to usebinary_accuracy
for multilabel problems (#498)
- Supply arguments to
generate
inTransformerSummarizer.summarize
- N/A
- N/A
- Support for Generative Question-Answering powered by OpenAI models, LangChan, and PaperQA. Ask questions to any set of documents and get back answers with citations to where the answer was found in your documents.
- N/A
- N/A
- N/A
- N/A
- resolved issue with using DeBERTa embedding models with NER (#492)
- easy-to-use-wrapper for sentiment analysis
- N/A
- N/A
- N/A
- N/A
- Ensure
do_sample=True
forGenerativeAI
- Support for generative AI with few-shot and zero-shot prompting using a model that can run on your own machine.
- N/A
- N/A
- Support for LexRank summarization
- N/A
- Bug fix in
dataset
module (#486)
- N/A
- Added
verbose
parameter topredict*
methods in allPredictor
classes
- N/A
- N/A
- Added
exclude_unigrams
argument totext.kw
module and support unigram extraction whennoun_phrases
is selected
- explicitly set
num_beams
andearly_stopping
forgenerate
inktrain.text.translation.core
to prevent errors intransformers>=4.26.0
- N/A
- N/A
- fixed typo in
translation
module (#479) - removed superfluous warning when inspecting
transformer
model signature
- N/A
- N/A
- Resolved bug that causes problems when loading PyTorch models (#478)
- Support for the latest version of
transformers
.
- Removed pin to
transformers==4.17
- Changed
numpy.float
andnumpy.int
tonumpy.float64
andnumpy.int_
respectively, inktrain.utils
(#474) - Removed
pandas
deprecation warnings fromktrain.tabular.prepreprocessor
(#475) - Ensure
use_token_type_ids
always exists inTransformerPreprocessor
objects to ensure backwards compatibility - Removed reference to
networkx.info
, as it was removed innetworkx>=3
- N/A
- N/A
- Changed NMF to accept optional parameters
nmf_alpha_W
andnmf_alpha_H
based on changes inscikit-learn==1.2.0
. - Change
ktrain.utils
to check for TensorFlow before doing a version check, so that ktrain can be imported without TensorFlow being installed.
- N/A
- N/A
- In TensorFlow 2.11, the
tf.optimizers.Optimizer
base class points the new keras optimizer that seems to have problems. Users should use legacy optimizers intf.keras.optimizers.legacy
with ktrain (which evidently will never be deleted). This means that, in TF 2.11, supplying a string representation of an optimizer like"adam"
tomodel.compile
uses the new optimizer instead of the legacy optimizers. In these cases, ktrain will issue a warning and automatically recompile the model with the defaulttf.keras.optimizers.legacy.Adam
optimizer.
- Support for TensorFlow 2.11. For now, as recommended in the TF release notes, ktrain has been changed to use the legacy optimizers in
tf.keras.optimizers.legacy
. This means that, when compiling Keras models, you should supplytf.keras.optimizers.legacy.Adam()
instead of the string"adam"
. - Support for Python 3.10. Changed references from
CountVectorizer.get_field_names
toCountVectorizer.get_field_names_out
. Updated supported versions insetup.py
.
- N/A
- fixed error in docs
- N/A
- N/A
- Adjusted tika imports due to issue with
/tmp/tika.log
in multi-user scenario
- N/A
- N/A
- Adjustment for kwe
- Fixed problem with importing
ktrain
without TensorFlow installed
- N/A
- N/A
- Fixed paragraph tokenization in
AnswerExtractor
- N/A
- re-arranged dep warnings for TF
- ktrain now pinned to
transformers==4.17.0
. Python 3.6 users can downgrade totransformers==4.10.3
and still use ktrain.
- N/A
- N/A
- updated dependencies to work with newer versions (but temporarily continue pinning to
transformers==4.10.1
)
- fixes for newer
networkx
- N/A
- N/A
- fix release
- N/A
TextPredictor.explain
andImagePredictor.explain
now use a different fork ofeli5
:pip install https://github.com/amaiya/eli5-tf/archive/refs/heads/master.zip
- Fixed
loss_fn_from_model
function to work withDISABLE_V2_BEHAVIOR
properly TextPredictor.explain
andImagePredictor.explain
now work withtensorflow>=2.9
andscipy>=1.9
(due to neweli5-tf
fork -- see above)
- N/A
- added
alnum
check and period check toKeywordExtractor
- fixed bug in
text.qa.core
caused by previous refactoring ofparagraph_tokenize
andtokenize
- N/A
- added
truncate_to
argument (default:5000) andminchars
argument (default:3) argument toKeywordExtractor.extract_keywords
method. - added
score_by
argument toKeywordExtractor.extract_keywords
. Default isfreqpos
, which means keywords are now ranked by a combination of frequency and position in document.
- N/A
- N/A
- Allow for returning prediction probabilities when merging tokens in sequence-tagging (PR #445)
- added basic ML pipeline test to workflow using latest TensorFlow
- N/A
- The
text.ner.models.sequence_tagger
now supports word embeddings from non-BERT transformer models (e.g.,roberta-base
,openai-gpt
). Thank to @Niekvdplas. - Custom tokenization can now be used in sequence-tagging even when using transformer word embeddings. See
custom_tokenizer
argument toNERPredictor.predict
.
- [breaking change] In the
text.ner.models.sequence_tagger
function, thebilstm-bert
model is now calledbilstm-transformer
and thebert_model
parameter has been renamed totransformer_model
. - [breaking change] The
syntok
package is now used as the default tokenizer forNERPredictor
(sequence-tagging prediction). To use the tokenization scheme from older versions of ktrain, you can import there
andstring
packages and supply this function to thecustom_tokenizer
argument:lambda s: re.compile(f"([{string.punctuation}“”¨«»®´·º½¾¿¡§£₤‘’])").sub(r" \1 ", s).split()
. - Code base was reformatted using black
- ktrain now supports TIKA for text extraction in the
text.textractor.TextExtractor
package with theuse_tika=True
argument as default. To use the old-style text extraction based on thetextract
package, you can supplyuse_tika=False
toTextExtractor
. - removed warning about sentence pair classification to avoid confusion
- N/A
- ktrain now supports simple, fast, and robust keyphrase extraction with the
ktran.text.kw.KeywordExtractor
module - ktrain now only issues a warning if TensorFlow is not installed, insteading of halting and preventing further use. This means that
pre-trained PyTorch models (e.g.,
text.zsl.ZeroShotClassifier
) and sklearn models (e.g.,text.eda.TopicModel
) in ktrain can now be used without having TensorFlow installed. text.qa.SimpleQA
andtext.qa.AnswerExtractor
now both support PyTorch with optional quantization (useframework='pt'
for PyTorch version)text.qa.SimpleQA
andtext.qa.AnswerExtractor
now both support aquantize
argument that can speed uptext.zsl.ZeroShotClassifier
,text.translation.Translator
, andtext.translation.EnglishTranslator
all support aquantize
argument.- pretrained image-captioning and object-detection via
transformers
is now supported
- reorganized imports
- localized seqeval
- The
half
parameter totext.translation.Translator
, andtext.translation.EnglishTranslator
was changed toquantize
and now supports both CPU and GPU.
- N/A
NERPredictor.predict
now includes areturn_offsets
parameter. If True, the results will include character offsets of predicted entities.
- In
eda.TopicModel
, changedlda_max_iter
tomax_iter
andnmf_alpha
toalpha
- Added
show_counts
parameter toTopicModel.get_topics
method - Changed
qa.core._process_question
toqa.core.process_question
- In
qa.core
, addedremove_english_stopwords
andand_np
parameters toprocess_question
- The
valley
learning rate suggestion is now returned inlearner.lr_estimate
andlearner.lr_plot
(whensuggest=True
supplied tolearner.lr_plot
)
- save
TransformerEmbedding
model, tokenizer, and configuration when savingNERPredictor
and resette_model
to facilitate loading NERPredictors with BERT embeddings offline (#423) - switched from
keras2onnx
totf2onnx
, which supports newer versions of TensorFlow
- N/A
- N/A
- added
get_tokenizer
call toTransformersPreprocessor._load_pretrained
to address issue #416
- N/A
- pin to
sklearn==0.24.2
due to breaking changes.eli5
fork for tf.keras updated for 0.24.2. To usescikit-learn==0.24.2
, users must uninstall and re-install theeli5
fork with:pip install https://github.com/amaiya/eli5/archive/refs/heads/tfkeras_0_10_1.zip
- N/A
- New vision models: added MobileNetV3-Small and EfficientNet. Thanks to @ilos-vigil.
core.Learner.plot
now supports plotting of any value that exists in the trainingHistory
object (e.g.,mae
if previously specified as metric). Thanks to @ilos-vigil.- added
raw_confidence
parameter toQA.ask
method to return raw confidence scores. Thanks to @ilos-vigil.
- pin to
transformers==4.10.3
due to Issue #398 - pin to
syntok==1.3.3
due to bug withsyntok==1.4.1
causing paragraph tokenization inqa
module to break - properly suppress TF/CUDA warnings by default
- ensure document fed to
keras_bert
tokenizer to avoid this issue
- speech transcription support
- N/A
- N/A
- N/A
- minor fix to installation due to pypi
- N/A
- N/A
- added
extra_requirements
tosetup.py
- changed imports for summarization, translation, qa, and zsl in notebooks and tests
- N/A
text.AnswerExtractor
is a universal information extractor powered by a Question-Answering module and capable of extracting user-specfied information from texts.text.TextExtractor
is a text extraction pipeline (e.g., convert PDFs to plain text)
- changed transformers pin to
transformers>=4.0.0,<=4.10.3
- N/A
- N/A
-N/A
SimpleQA
now can load PyTorch question-answering checkpoints- change API call to support newest
causalnlp
- N/A
- N/A
- check for
logits
attribute when predicting usingtransformers
- change raised Exception to warning for longer sequence lengths for
transformers
- N/A
- Added
method
parameter totabular.causal_inference_model
.
- N/A
- Added
tabular.causal_inference_model
function for causal inference support.
- N/A
- N/A
- N/A
- added
query
parameter toSimpleQA.ask
so that an alternative query can be used to retrieve contexts from corpus - added
chardet
as dependency forstellargraph
- fixed issue with
TopicModel.build
whenthreshold=None
- API documenation index
- Added warning when a TensorFlow version of selected
transformers
model is not available and the PyTorch version is being downloaded and converted instead usingfrom_pt=True
.
- Fixed
utils.metrics_from_model
to support alternative metrics - Check for AUC
ktrain.utils
"inspect" function
- N/A
shallownlp.ner.NER.predict
processes lists of sentences in batches resulting in faster predictionsbatch_size
argument added toshallownlp.ner.NER.predict
- added
verbose
parameter toktrain.text.textutils.extract_copy
to optionally see why each skipped document was skipped
- Changed
TextPredictor.save
to save Hugging Face tokenizer files locally to ensure they can be easily reloaded whentext.Transformer
is supplied with local path. - For
transformers
models, thepredictor.preproc.model_name
variable is automatically updated to be newPredictor
folder to avoid having users manually updatemodel_name
. Applies when a local path is supplied totext.Transformer
and resultantPredictor
is moved to new machine.
- N/A
NERPredictor.predict
now optionally accepts lists of sentences to make sequence-labeling predictions in batches (as all otherPredictor
instances already do).
- N/A
- N/A
- expose errors from
transformers
in_load_pretrained
- Changed
TextPreprocessor.check_trained
to be a warning instead of Exception
- N/A
- Support for transformers 4.0 and above.
- added
set_tokenizer to
TransformerPreprocessor` - show error message when original weights cannot be saved (for
reset_weights
method)
- cast filename to string before concatenating with suffix in
images_from_csv
andimages_from_df
(addresses issue #330) - resolved import error for
sklearn>=0.24.0
, buteli5
still requiressklearn<0.24.0
.
- N/A
- N/A
- fixed problem with
LabelEncoder
not properly being stored whentexts_from_df
is invoked - refrain from invoking
max
on empty sequence (#307) - corrected issue with
return_proba=True
in NER predictions (#316)
- N/A
- A
steps_per_epoch
argument has been added to all*fit*
methods that operate on generators - Added
get_tokenizer
methods to all instances ofTextPreprocessor
- propogate custom metrics to model when
distilbert
is chosen intext_classifier
andtext_regression_model
functions - pin
scikit-learn
to 0.24.0 sue to breaking change
- N/A
- N/A
- Added
custom_objects
argument toload_predictor
to load models with custom loss functions, etc. - Fixed bug #286 related to length computation when
use_dynamic_shape=True
- N/A
- Added
use_dynamic_shape
parameter totext.preprocessor.hf_convert_examples
which is set toTrue
when running predictions. This reduces the input length when making predictions, if possible.. - Added warnings to some imports in
imports.py
to allow for slightly lighter-weight deployments - Temporarily pinning to
transformers>=3.1,<4.0
due to breaking changes in v4.0.
- Suppress progress bar in
predictor.predict
forkeras_bert
models - Fixed typo causing problems when loading predictor for Inception models
- Fixes to address documented/undocumented breaking changes in
transformers>=4.0
. But, temporarily pinning totransformers>=3.1,<4.0
for backwards compatibility.
- The
SimpleQA.index_from_folder
method now supports text extraction from many file types including PDFs, MS Word documents, and MS PowerPoint files.
- The default in
SimpleQA.index_from_list
andSimpleQA.index_from_folder
has been changed tobreakup_docs=True
.
- N/A
- N/A
ktrain.text.textutils.extract_copy
now usestextract
to extract text from many file types (e.g., PDF, DOC, PPT) instead of just PDFs,
- N/A
- N/A
- N/A
- Change exception in model ID check in
Translator
to warning to better allow offline language translations
Predictor
instances now provide built-in support for exporting to TensorFlow Lite and ONNX.
- N/A
- N/A
- N/A
- Use fast tokenizers for the following Hugging Face transformers models: BERT, DistilBERT, and RoBERTa models. This change affects models created with either
text.Transformer(...
ortext.text_clasifier('distilbert',..')
. BERT models created withtext_classifier('bert',..
, which useskeras_bert
instead oftransformers
, are not affected by this change.
- N/A
- N/A
- N/A
- Resolved issue in
qa.ask
method occuring with embedding computations when full answer sentences exceed 512 tokens.
- Support for upcoming release of TensorFlow 2.4 such as removal of references to obsolete
multi_gpu_model
- [breaking change]
TopicModel.get_docs
now returns a list of dicts instead of a list of tuples. Each dict has keys:text
,doc_id
,topic_proba
,topic_id
. - added
TopicModel.get_document_topic_distribution
- added
TopicModel.get_sorted_docs
method to return all documents sorted by relevance to a giventopic_id
- Changed version check warning in
lr_find
to a raised Exception to avoid confusion when warnings from ktrain are suppressed - Pass
verbose
parameter tohf_convert_examples
- N/A
- changed
qa.core.display_answers
to make URLs open in new tab
- pin to
seqeval==0.0.19
due tonumpy
version incompatibility with latest TensorFlow and to suppress errors during installation
- N/A
- N/A
- fixed issue with missing noun phrase at end of sentence in
extract_noun_phrases
- fixed TensorFlow versioning issues with
utils.metrics_from_model
- added
extract_noun_phrases
totextutils
SimpleQA.ask
now includes aninclude_np
parameter. When True, noun phrases will be used to retrieve documents containing candidate answers.
- N/A
- N/A
- added optional
references
argument toSimpleQA.index_from_list
- added
min_words
argument toSimpleQA.index_from_list
andSimpleQA.index_from_folder
to prune small documents or paragraphs that are unlikely to include good answers qa.display_answers
now supports hyperlinks for document references
- N/A
- added
breakup_docs
argument toindex_from_list
andindex_from_folder
that potentially speeds upask
method substantially - added
batch_size
argument toask
and set default at 8 for faster answer-retrieval
- refactored
QA
andSimpleQA
for better extensibility
- Ensure
save_path
is correctyl processed inLearner.evaluate
- N/A
- Changed installation instructions in
README.md
to reflect that using ktrain with TensorFlow 2.1 will require downgradingtransformers
to 3.1.0. - updated requirements with
keras_bert>=0.86.0
due to TensorFlow 2.3 error with older versions ofkeras_bert
- In
lr_find
andlr_plot
, check for TF 2.2 or 2.3 and make necessary adjustments due to TF bug 41174.
- fixed typos in
__all__
intext
and graph` modules (PR #250) - fixed Chinese language translation based on on name-changes of models with
zh
as source language
- N/A
- added
TopicModel.get_word_weights
method to retrieve the word weights for a given topic - added
return_fig
option toLearner.lr_plot
andLearner.plot
, which allows the matplotlibFigure
to be returned to user
- N/A
- N/A
SUPPRESS_KTRAIN_WARNINGS
environment variable changed toSUPPRESS_DEP_WARNINGS
- N/A
- N/A
- added
num_beams
andearly_stopping
arguments totranslate
methods intranslation
module that can be set to improve translation speed - added
half
parameter toTranslator
construcor
- N/A
- Added
translate_sentences
method toTranslator
class that translates list of sentences, where list is fed to model as single batch
- Removed TensorFlow dependency from
setup.py
to allow users to use ktrain with any version of TensorFlow 2 they choose. - Added
truncation=True
to tokenization insummarization
module - Require
transformers>=3.1.0
due to breaking changes SUPPRESS_TF_WARNINGS
environment variable changed toSUPPRESS_KTRAIN_WARNINGS
- Use
prepare_seq2seq_batch
insteadd ofprepare_translation_batch
intranslation
module due to breaking change intransformers==3.1.0
- N/A
- N/A
- Always use
*Auto*
classes to loadtransformers
models to prevent loading errors
- N/A
- N/A
- Added missing
torch.no_grad()
scope intext.translation
andtext.summarization
modules
- added
nli_template
parameter toZeroShotClassifier.predict
to allow versatility in the kinds of labels that can be predicted - efficiency improvements to
ZeroShotClassifier.predict
that allow faster predictions on large sequences of documents and a large numer of labels to predict - added 'multilabel
parameter to
ZeroShotClassifier.predict` - added
labels
parameter toZeroShotClassifer.predict
, an alias totopic_strings
parameter
- N/A
- Allow variations on
accuracy
metric such asbinary_accuracy
when inpecting model inis_classifier
- N/A
- N/A
- In
texts_from_array
, checkclass_names
only after preprocessing before printing classification vs. regression status.
- N/A
- N/A
- In
TextPreprocessor
instances, correctly resetclass_names
when targets are in string format.
- N/A
- added
class_weight
parameter tolr_find
for imbalanced datasets - removed pins for
cchardet
andscikitlearn
fromsetup.py
- added version check for
eli5
fork - removed
scipy
pin fromsetup.py
- Allow TensorFlow 2.3 for Python 3.8
- Request manual installation of
shap
inTabularPredictor.explain
instead of inclusion insetup.py
- N/A
- N/A
-N/A
- include metrics check in
is_classifier
function to support with non-standard loss functions
- N/A
-N/A
- Ensure transition to
YTransform
is backwards compatibility forStandardTextPreprocessor
andBertPreprocessor
- N/A
TextPreprocessor
instances now useYTransform
class to transform targetstexts_from_df
,texts_from_csv
, andtexts_from_array
employ the use of eitherYTransformDataFrame
orYTransform
images_from_df
,images_from_fname
,images_from_csv
, andimagas_from_array
useYTransformDataFrame
orYTransform
- Extra imports removed from PyTorch-based
zsl.core.ZeroShotClassifier
andsummarization.core.TransformerSummarizer
. If necessary, both can now be used without having TensorFlow installed by installing ktrain using--no-deps
and importing these modules using a method like this.
- N/A
- N/A/
NERPredictor.predict
was changed to accept an optionalcustom_tokenizer
argument
- N/A
- N/A
- N/A
- added missing
num_classes
argument toto_categorical
- N/A
- Adjusted
no_grad
scope inZeroShotClassifier.predict
- N/A
- support for
tabular
data including explainable AI for tabular predictions learner.validate
andlearner.evaluate
now support regression models- added
restore_weights_only
flag tolr_find
. When True, only the model weights will be restored after simulating training, not the optimizer weights. In at least a few observed cases, this "warm up" seems to improve performance when actual training begins. Further investigation is needed, so it is False by default.
- N/A
- added
save_path
argument toLearner.validate
andLearner.evaluate
. Ifprint_report=False
, classification report will be saved as CSV tosave_path
. - Use
torch.no_grad
withZeroShotClassifier.predict
to prevent OOM - Added
max_length
parameter toZeroShotClassifier.predict
to prevent errors on long documnets - Added type check to
TransformersPreprocessor.preprocess_train
- N/A
- N/A
- Changed
qa
module to use use 'Auto' when loadingQuestionAnswering
models and tokenizer - try
from_pt=True
forqa
module if initial model-loading fails - use
get_hf_model_name
inqa
module
- N/A
- N/A
- return gracefully if no documents match question in
qa
module - tokenize question in
qa
module to ensure all candidate documents are returned - Added error in
text.preprocessor
when training set has incomplete integer labels
- added
batch_size
argument toZeroShotClassifier.predict
that can be increased to speed up predictions. This is especially useful iflen(topic_strings)
is large.
- N/A
- fixed typo in
load_predictor
error message
- N/A
- updated doc comments in core module
- removed unused
nosave
parameter fromreset_weights
- added warning about obsolete
show_wd
parameter inprint_layers
method - pin to
scipy==1.4.1
due to TensorFlow requirement
- N/A
- N/A
- Use
tensorflow==2.1.0
if Python 3.6/3.7 and usetensorflow==2.2.0
only if on Python 3.8 due to TensorFlow v2.2.0 issues
- N/A
- N/A
- Fixes to address changes or issues in TensorFlow 2.2.0:
- created
metrics_from_model
function due to changes in the way metrics are extracted from compiled model - use
loss_fn_from_model
function due to changes in they way loss functions are extracted from compiled model - addd
**kwargs
to `AdamWeightDecay based on this issue - changed
TransformerTextClassLearner.predict
andTextPredictor.predict
to deal with tuples being returned bypredict
in TensorFlow 2.2.0 - changed multilabel test to use loss insead of accuracy due to TF 2.2.0 issue
- changed
Learner.lr_find
to usesave_model
andload_model
to restore weights due to this TF issue and addedTransformersPreprocessor.load_model_and_configure_from_data
to support this
- created
- N/A
- N/A
- N/A
- Explicitly supply
'truncate='longest_first'
to prevent sentence pair classification from breaking intransformers==3.0.0
- Fixed typo in
encode_plus
invocation
- N/A
- N/A
- Explicitly supply
'truncate='longest_first'
to prevent sentence pair classification from breaking intransformers==3.0.0
- N/A
- N/A
- Changed
setup.py
to open README file usingencoding="utf-8"
to prevent installation problems on Windows machines withcp1252
encoding
- Added support for Russian in
text.EnglishTranslator
- N/A
- N/A
- N/A
- N/A
- Properly set device in
text.Translator
and use cuda when available
- support for language translation using pretraiend
MarianMT
models - added
core.evaluate
as alias tocore.validate
Learner.estimate_lr
method will return numerical estimates of learning rate using two different methods. Should only be called after runningLearner.lr_find
.
text.zsl.ZeroShotClassifier
changed to useAutoModel*
andAutoTokenizer
in order to load anymlni
model- remove external modules from
ktrain.__init__.py
so that they do not appear when pressing TAB in notebook - added
Transformer.save_tokenizer
andTransformer.get_tokenizer
methods to facilitate training on machines with no internet
- explicitly call
plt.show()
inLRFinder.plot_loss
to resolved issues with plot not displaying in certain cases (PR #170) - suppress warning about text regression when making text regression predictions
- allow
xnli
models forzsl
module
- added
metrics
parameter totext.text_classifier
andtext.text_regression_model
functions - added
metrics
parameter toTransformer.get_classifier
andTransformer.get_regrssion_model
methods
metric
parameter invision.image_classifier
andvision.image_regression_model
functions changed tometrics
- N/A
- N/A
- default model for summarization changed to
facebook/bart-large-cnn
due to breaking change in v2.11 - added
device
argument toTransformerSummarizer
constructor to control PyTorch device
- require
transformers>=2.11.0
due to breaking changes in v2.11 related toBART
models
- N/A
- N/A/
- prevent
transformer
tokenizers from being pickled duringpredictor.save
, as it causes problems for some community-uploaded models likebert-base-japanese-whole-word-masking
.
- support for Zero-Shot Topic Classification via the
text.ZeroShotClassifier
.
- N/A/
- N/A
- N/A
- N/A/
- Added the
procs
,limitmb
, andmultisegment
argumetns toindex_from_list
andindex_from_folder
method intext.SimpleQA
to speedup indexing when necessary. Supplyingmultisegment=True
speeds things up significantly, for example. Defaults, however, are the same as before. Users must explicitly change values if desiring a speedup. - Load
xlm-roberta*
asjplu/tf-xlm-roberta*
to bypass error fromtransformers
- N/A
- [breaking change] The
multilabel
argument intext.Transformer
constructor was moved toTransformer.get_classifier
and now correctly allows users to forcibly configure model for multilabel task regardless as to what data suggests. However, it is recommended to leave this value asNone
. - The methods
predictor.save
,ktrain.load_predictor
,learner.save_model
,learner.load_model
all now accept a path to folder where all files (e.g., model file,.preproc
file) will be saved. If path does not exist, it will be created. This should not be a breaking change as theload*
methods will still look for files in the old location if model or predictor was saved using an older version of ktrain.
- N/A
- N/A
- Added
n_samples
argument toTextPredictor.explain
to address slowness ofexplain
on Google Colab - Lock to version 0.21.3 of
scikit-learn
to ensure old-style explanations are generated fromTextPredictor.explain
- added missing
import pickle
to ensure saved topic models can be loaded
- N/A
- Changed
Transformer.preprocess*
methods to accept sentence pairs for sentence pair classification
- N/A
- Out-of-the-box support for image regression
vision.images_from_df
function to load image data from pandas DataFrames
- references to
fit_generator
andpredict_generator
converted tofit
andpredict
- Resolved issue with multilabel detection returning
False
for valid multilabel problems when data is in form of generator
- Added
TFDataset
class for use as wrapper around arbitrarytf.data.Dataset
objects for use in ktrain
- Added
NERPreprocessor.preprocess_train_from_conll2003
- Removed extraneous imports from
text.__init__.py
andvision.__init__.py
classes
argument inimages_from_array
changed toclass_names
- ensure NER data is properly prepared
text.ner.learner.validate
- fixed typo with
df
reference inimages_from_fname
- If no validation data is supplied to
images_from_array
, training data is split to generate validation data
- issue warning if Learner cannot save original weights
images_from_array
accepts labels in the form of integer class IDs
- fix pandas
SettingwithCopyWarning
fromimages_from_csv
- fixed issue with
return_proba=True
including class labels for multilabel image classification - resolved issue with class labels not being set correctly in
images_from_array
- lock to
cchardet==2.1.5
due to this issue - fixed
y_from_data
from NumpyArrayIterators in image classification
- N/A
- N/A
- fixed issue with MobileNet model due to typo and added MobileNet example notebook
- N/A
- added
merge_tokens
andreturn_proba
options toNERPredictor.predict
- N/A
- N/A
- added
textutils
totext
namespace and added note aboutsent_tokenize
to sequence-tagging tutorial
- cast dependent variable to
tf.float32
instead oftf.int64
for text regression problems usingtransformers
library
- N/A
- added
suggest
option tocore.Learner.lr_plot
- set interactive mode for matplotlib so plots show automatically from Python console and PyCharm
- run prepare for NER sequence predictor to avoid matrix mismatch
- N/A
- N/A
- ensure
text.eda.TopicModel.visualize_documents
works withbokeh
v2.0.x
- support for building Question-Answering systems
textutils
now containsparagraph_tokenize
function
- N/A
- resolved import issue with `textutils.sent_tokenize'
- N/A
TransformerSummarizer
accepts BARTmodel_name
as parameter
- N/A
- support for link prediction with graph neural networks
- text summarization with pretrained BART (included in 0.13.1 but not in 0.13.0)
bigru
method now selects pretrained word vectors based on detected language
- instead of throwing error, default to English if
detect_lang
could not detect language from batch of texts layers
argument moved toTransformerEmbedding
constructor- enforce specific version of TensorFlow due to undocumented breaking changes in newer TF versions
AdamWeightDecay
optimizer is now used to support global weight decay. Used when user excplictly sets a weight decay
- force re-instantiation of
TransformerEmbedding
object withsequence_tagger
function is re-invoked
- Added
max_momentum
andmin_momentum
parameters toautofit
andfit_onecycle
to control cyclical momentum
- Prevent loading errors of previously saved NERPreprocessor objects
- N/A
- N/A
- Require at least TensorFlow 2.1.0 is installed in
setup.py
due to TF 2.0.0 bug withlr_find
- Added lower bounds to scikit-learn and networkx versions
- N/A
- N/A
- N/A
- check and ensure AllenNLP is installed when Elmo embeddings are selected for NER
- BERT and Elmo embeddings for NER and other downstream tasks
wv_path_or_url
parameter moved fromentities_from*
tosequence_taggers
- Added
use_char
parameter and ensure it is not used unlessDISABLE_V2_BEHAVIOR
is enabled: batch_size
argument added toget_predictor
andload_predictor
eval_batch_size
argument added toget_learner
- added
val_pct
argument toentities_from_array
- properly set threshold in
text.eda
(PR #99) - fixed error when no validation data is supplied to
entities_from_array
- N/A
- N/A
- prevent errors with reading word vector files on Windows by specifying
encoding='utf-8'
- N/A
- N/A
ktrain.text.eda.visualize_documents
now properly processes filepath argument
entities_from_txt
,entities_from_gmb
, andentities_from_conll2003
functions now discover the encoding of the file automatically whenencoding=None
(which is the default now)
- N/A
- N/A
- sequence-taging (e.g., NER) now supports ELMo embeddings with
use_elmo=True
argument to data-loading functions likeentities_from_array
andentities_from_txt
A - pretrained word embeddings (i.e., fasttext word2vec embeddings) can be specified by providing the URL to
a
.vec.gz
file from here. The URL (or path) is supplied aswv_path_or_url
argument to data-loading functions likeentities_from_array
andentities_from_txt
show_random_images
: show random images from folder in Jupyter notebookNERPreprocessor
now includes apreprocess_test
method for easier evaluation of test sets in datasets that contain a training, validation, and test set
- ensure
DISABLE_V2_BEHAVIOR=True
whenImagePredictor.explain
is invoked - added
SUPPRESS_TF_WARNINGS
environment variable. Default is '1'. If set to '0', TF warnings will be displayed. merge_entities
method ofktrain.text.shallownlp.ner.NER
changed tomerge_tokens
- moved
load_predictor
to constructor inkrain.text.shallownlp.ner.NER
ktrain.text.shallownlp.ner.NER
now supportspredictor_path
argument
- convert
class_names
to strings incore.validate
to prevent error from scikit-learn - fixed error arising when no data augmentation scheme is provided to the
images_from*
functions - fixed bug in
images_from_fname
to ensure suppliedpattern
is used - added
val_folder
argument toimages_from_fname
- raise Exception when
preproc
is not found inload_predictor
- check for existence of
preproc
intext_classifier
andtext_regression_model
- fixed
text.eda
so thatdetect_lang
is called correctly after being moved totextutils
- N/A
shallownlp.Classifier.texts_from_folder
changed toshallownlp.Classifier.load_texts_from_folder
shallownlp.Classifier.texts_from_csv
changed toshallownlp.Classifier.load_texts_from_csv
- In
text.preprocessor
, added warning thatclass_names
is being ignored whenclass_names
were supplied andy_train
andy_test
contain string labels
- N/A
Transformer
API in ktrain now supports using community-uploaded transformer models- added
shallownlp
module with out-of-the-box NER for English, Russian, and Chinese text.eda
module now supports NMF in addition to LDA
texts_from_csv
andtexts_from_df
now accept a single column of labels in string format and will 1-hot-encode labels automatically for classification or multi-class classification problems.- reorganized language-handling to
text.textutils
- more suppression of warnings due to spurious warnings from TF2 causing confusion in output
classes
argument toTransformer
constructor has been changed toclass_names
for consistency withtexts_from_array
- N/A
- N/A
- changed pandas dependency to
>=1.0.1
due to bug in pandas 1.0
- N/A
- N/A
- Transformed data containers for transformers, NER, and graph -node classification to be
instances of
ktrain.data.Dataset
.
- fixed
images_from_array
so that y labels are correctly 1-hot-encoded when necessary - correct tokenization for
bert-base-japanese
Transformer models from PR 57
- N/A
- Removed Exception when
distilbert
is selected intext_classifier
for non-English language after Hugging Face fixed the reported bug.
- XLNet models like
xlnet-base-cased
now works after casting input arrays toint32
- modified
TextPredictor.explain
to propogate correct error message fromeli5
for multilabel text classification.
- N/A
- N/A
- fixed
utils.nclasses_from_data
forktrain.Dataset
instances - prevent
detect_lang
failing when Pandas Series is supplied
- support for out-of-the-box text regression in both the
Transformer
API and conventional API (i.e.,text.text_regression_model
).
text.TextPreprocessor
prints sequence length statistics
- auto-detect language when using
Transformer
class to prevent printingen
as default
- N/A
MultiArrayDataset
accepts list of Numpy arrays
- fixed incorrect activation in
TextPredictor
for multi-label Transformer models - fixed
top_losses
for regression tasks
- initial base
ktrain.Dataset
class for use as a Sequence wrapper to better support custom datasets/models
- N/A
- N/A
- N/A
- N/A
- fix to support multilabel text classification in
Transformers
_prepare_dataset
no longer breaks when validation dataset has not been supplied
- availability of a new, simplied interface to Hugging Face transformer models
- added 'distilbert' as an available model in
text.text_classifier
function
preproc
argument is required fortext.text_classifier
core._load_model
calls_make_predict_function
before returning model- added warning when non-adam optimizer is used with
cycle_momentum=True
- N/A
- N/A
- Fixed error when using ktrain with v0.2.x of
fastprogress
. ktrain can now be used with both v0.1.x and v0.2.x offastprogress
- All data-loading functions (e.g.,
texts_from_csv
) accept arandom_state
argument that will enable consistent reproduction of the train-test split.
- perform local checks for
stellargraph
where needed. - removed
stellargraph
as dependency due to issues with it overwritingtensorflow-gpu
- change
setup.py
to skip navigation links for pypi page
- N/A
- All data-loading functions (e.g.,
texts_from_csv
) accept arandom_state
argument that will enable consistent reproduction of the train-test split.
- perform local checks for
stellargraph
where needed. - removed
stellargraph
as dependency due to issues with it overwritingtensorflow-gpu
- N/A
- ktrain now uses tf.keras (
tensorflow>=1.14,<=2.0
) instead of stand-alone Keras.
- N/A
- N/A
- N/A
- added encoding argument when reading in word vectors to bypass error on Windows systems (PR #31)
- Change preprocessing defaults and apply special preprocessing in
text.eda.get_topic_model
when non-English is detected.
- N/A
- N/A
TextPredictor.explain
now correcty supports non-English languages.- Parameter
activation
is no longer ignored in_build_bert
function
- support for learning from unlabeled or partially-labeled text data
- unsupervised topic modeling with LDA
- one-class text classification to score documents based on similarity to a set of positive examples
- document recommendation engine
- N/A
- Removed dangling reference to external 'stellargraph' dependency from
_load_model
, so that we rely solely on local version of stellargraph
- N/A
- N/A
- Removed dangling reference to external 'stellargraph' dependency so that we rely solely on local version of stellargraph
- N/A
- N/A
- store a local version of
stellargraph
to prevent it from installingtensorflow-cpu
and overriding existingtensorflow-gpu
installation
- Support for node classification in graphs with
ktrain.graph
module
- N/A
- N/A
- N/A
- N/A
- Call
reset
beforepredict_generator
for consistent ordering ofview_top_losses
results - Fixed incorrect reference to
train_df
instead ofval_df
intexts_from_df
- All
fit
methods in ktrain now acceptclass_weight
parameter to handle imbalanced datasets.
- N/A
- Resolved problem with
text_classifier
incorrectly usinguncased_L-12_H-768_A-12
to build BERT model instead ofmulti_cased_L-12_H-768_A-12
when non-English language was detected. - Fixed error messages releated to preproc requirement in
text_classifier
- Fixed test script for multingual text classification
- Fixed rendering of Chinese in
view_top_losses
- N/A
- N/A
- Fix problem with
text_classifier
incorrectly usinguncased_L-12_H-768_A-12
to build BERT model instead ofmulti_cased_L-12_H-768_A-12
when non-English language was detected.
- Added multilingual support for text classification.
- Added experimental support for tf.keras. By default, ktrain will use standalone Keras.
If
os.environ['TF_KERAS']
is set, ktrian will attempt to use tf.keras. Some capabilities (e.g.,predictor.explain
for images) are not yet supported for tf.keras
- When BERT is selected, check to make sure dataset is correctly preprocessed for BERT
- Fixed
utils.bert_data_type
and ensures it does more checks to validate BERT-style data
- N/A
- globally import tensorflow
- suppress tensorflow deprecation warnings from TF 1.14.0
- Resolved issue with
text_classifier
failing when BERT is selected and Preprocessor is supplied.
- Support for sequence tagging with Bidirectional LSTM-CRF. Word embeddings can currently be either random or word2vec(cbow). If latter chosen, word vectors will be downloaded automaticlaly from Facebook fasttext site.
- Added
ktra.text.texts_from_df
function
- Added FutureWarning in
text.text_classifier
, thatpreproc
will be required argument in future. - In
text.text_classifier
, whenpreproc=None
, use the maximum feature ID to populate max_features.
- Fixed construction of custom_objects dictionary for BERT to ensure load_model works for custom BERT models
- Resolved issue with pretrained bigru models failing when max_features >= than total word count.
explain
methods have been added toTextPredictor
andImagePredictor
objects.TextPredictor.predict_proba
andImagePredictor.predict_proba_*
convenience methods have been added.- Added
utils.is_classifier
utility function
TextPredictor.predict
method can now accept a single document as input instead of always requiring a list.- Output of
core.view_top_losses
now includes the ground truth label of examples
- Fixed test of data loading
- added additional tests of ktrain
- Added
classes
argument tovision.images_from_folder
. Only classes/subfolders matching a name in theclasses
list will be considered.
- Resolved issue with using
learner.view_top_losses
with BERT models.
- N/A
- Added
classes
argument tovision.images_from_folder
. Only classes/subfolders matching a name in theclasses
list will be considered.
- Fixed issue with
learner.validate
andlearner.predict
failing when validation data is in the form of an Iterator (e.g., DirectoryIterator).
- N/A
- Added check in
ktrain.lroptimize.lrfinder
to stop training if learning rate exceeds a fixed maximum, which may happen when bad/dysfunctional model is supplied to learning rate finder.
- In
ktrain.text.data.texts_from_folder
function, only subfolders specified in classes argument are read in as training and validation data.
- N/A
- N/A
- Fixed error related to validation_steps=None in call to fit_generator in
ktrain.core
on Google Colab.
- Support for pretrained BERT Text Classification
- For
Learner.lr_find
, added optionalmax_epochs
argument. - Changed
Learner.confusion_matrix
toLearner.validate
and added optionalval_data
argument. Theuse_valid
argument has been removed. - Removed
pretrained_fpath
argument totext.text_classifier
. Pretrained word vectors are now downloaded automatically when 'bigru' is selected as model.
- Further cleanup of
utils.is_iter
function to use type check.
- N/A
- For
Learner.lr_find
, removed epochs and max_lr arguments and added lr_mult argument Default lr_mult is 1.01, but can be changed to control size of sample being used to estimate learning rate. - Changed structure of examples folder
- Resolved issue with
utils.y_from_data
not working correctly with DataFrameIterator objects.
- N/A
- Use class check in utils.is_iter as temporary fix
- revert to epochs=5 for
Learner.lr_find
- N/A
- N/A
- N/A
Learner.set_weight_decay
now works correctly
- BIGRU text classifier: Bidirectional GRU using pretrained word embeddings
- Epochs are calculated automatically in
LRFinder
- Number of epochs that
Learner.lr_find
runs can be explicitly set again
- relocated calls to tensorflow
- installation instructions and reformatted examples
- cycle_momentum argument for both
autofit
andfit_onecycle
method that will cycle momentum between 0.95 and 0.85 as described in this paper Learner.plot
method that will plot training-validation loss, LR schedule, or momentum schedule- added
set_weight_decay
andget_weight_decay
methods to get/set "global" weight decay in Keras
vision.data.preview_data_aug
now displays images in rows by default- added multigpu flag to
core.get_learner
with comment that it is only supported byvision.model.image_classifier
- added
he_normal
initialization to FastText model
- Bug in
vision.data.images_from_fname
that prevented relative paths for directory argument - Bug in
utils.y_from_data
that returned incorrect information for array-based training/validation data - Bug in
core.autofit
with callback failure when validation data is not set - Bug in
core.autofit
andcore.fit_onecycle
with learning rate setting at end of cycle
- Last release without CHANGELOG updates