Skip to content
This repository has been archived by the owner on Jun 28, 2024. It is now read-only.

Have any "succesor libraries" emerged, as Cam suggested? #414

Open
shgidi opened this issue Feb 24, 2021 · 20 comments
Open

Have any "succesor libraries" emerged, as Cam suggested? #414

shgidi opened this issue Feb 24, 2021 · 20 comments

Comments

@shgidi
Copy link

shgidi commented Feb 24, 2021

No description provided.

@ColtAllen
Copy link

ColtAllen commented Jan 31, 2022

UPDATE: pymc-marketing will become the new successor to this library.

I know this post is nearly a year old, but I would be happy to collaborate with others on a successor library built in PyMC.

I've recently started working on a CLV project and already foresee the time-based splitting of calibration and holdout data as a considerable limitation. Random and/or stratified sampling to ensure calibration and holdout data are equally distributed would be my priority, but the built-in statistical functions of PyMC would lend themselves well to this project, and model training can be distributed across GPUs and dramatically reduce training time.

I'm still proceeding with lifetimes as-is for the beta release of my CLV project, so I won't have much time to dedicate to a successor library until Mar 2022, but if anyone is interested, please respond to this issue.

@CamDavidsonPilon CamDavidsonPilon pinned this issue Jan 31, 2022
@shgidi
Copy link
Author

shgidi commented Feb 7, 2022

@ColtAllen feel free to contact me

@gpyga
Copy link

gpyga commented Feb 7, 2022

@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.

@rodriveracom
Copy link

I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory. The idea of Pyro sounds very compelling. How would you like to organize the project?

@ColtAllen
Copy link

ColtAllen commented Feb 27, 2022

@shgidi @gpyga @RodrigoRivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.

@ColtAllen, I am personally more interested in a TensorFlow Probability-based successor, having not worked with Pyro much, but I would be interested in assisting and seeing where there may be overlap.

Pyro is to PyTorch what TFProb is to TensorFlow. If this project takes off, then support for both libraries would be a great direction to go. I personally prefer Pyro because open-source is only as good as the supporting documentation. I starting working with TFProb back in 2017 when it was still called Edward, but have since moved away from it because the vague yet verbose documentation - which even has a few broken links - created considerable friction in my projects:

https://www.tensorflow.org/probability/overview

The documentation for Pyro on the other hand is among the best I’ve ever seen for an open-source library:

https://docs.pyro.ai/en/stable/

Both packages are also relatively low-level. Base TF can be cumbersome to work with, whereas PyTorch was expressly written to have a syntax similar to NumPy:

https://pytorch.org/tutorials/beginner/basics/tensorqs_tutorial.html

Speaking of numpy:

https://examples.dask.org/array.html

I have been working, albeit slowly, on building a successor on Dask instead of Pandas. I see the challenge of doing CLV on millions of users and not being able to fit things in memory.

Dask is basically a distributed drop-in replacement for numpy and would be an excellent alternative for the RFM aggregations. My current project has over 88 million transactions, so my team had to create a separate RFM feature store just to use lifetimes.

The idea of Pyro sounds very compelling. How would you like to organize the project?

In the Zoom call, I want to address and attain common agreement in the following areas:

  • Problems
  • Goals
  • Contributing

I’ve reviewed the GitHub issues for lifetimes in detail, and we each have our own lists of problems to bring up I’m sure, but let’s not confuse issues with features we’d like to see added.

I like the OKR approach for setting goals (qualitative Objectives and measurable Key Results) but I’m not married to the methodology by any means. A good objective would be to make lifetimes the premiere open-source library for stochastic RFM and CLV modeling. The number of models supported, reducing training times and the rate of convergence errors, and increasing the number of GitHub Stars and Watches are all ways we can measure this.

Lastly, the documentation for lifetimes is quite good, but I want to review the contributor’s guide in particular, make any desired changes, and ensure we’re all in alignment before going full-speed ahead with code development, because it will make PRs go much more smoothly in the future.

After these preliminaries are out of the way, we can put a task list together and set up GitHub Project pages for each. Looking forward to working with you all!

@deepyaman
Copy link

@ColtAllen I would also be interested on collaborating on a successor library, and would love to join an upcoming call (if the kickoff you mentioned hasn't happened yet)!

We use lifetimes in our CVM toolkit at my current company, but I was looking into how we may have access to a wider variety of methods than it currently implements (was looking at the R libraries like btydplus and CLVTools for inspiration). No strong opinion like the rest of you on backend thus far, although I have slightly more exposure to Dask than the other alternatives.

@rodriveracom
Copy link

@shgidi @gpyga @RodrigoRivera Want to plan a Zoom call to discuss this further? I’m in the Denver area, Mountain Standard Time (UTC-7:00). I have a draft prepared of the details I want to discuss, but I’ll provide an overview here and address your comments.

Absolutely. I am in Central European Time. Should we aim at having a call on the second or third week of March?

@ColtAllen
Copy link

@RodrigoRivera Awesome! How about either March 13th or 20th for the Zoom meeting? Due to time zone differences, I see this happening around noontime for those in the Americas, and in the evening for those in Europe.

@deepyaman Hope you can join! I've been looking at the btydplus and CLVTools R libraries as well, and am even considering rpy2 (a Python API for R) as a band-aid for the MLE convergence issues I've been encountering in lifetimes so far.

@deepyaman
Copy link

March 13 works for me personally!

@ColtAllen I’ve used rpy2 in the past to use some epidemiological modeling package that—at least at the time—had no reasonable Python equivalent. My intuition is to steer clear of it for a successor to lifetimes, since requiring an R runtime for a Python package ends up being very inconvenient/limiting from a production deployment perspective (suddenly all the Docker images need to have R installed, etc.).

@ColtAllen
Copy link

ColtAllen commented Mar 5, 2022

@deepyaman Great! I'll let @RodrigoRivera pick the time since this will be happening at the very end of his day, and I'll post the Zoom link here for anyone to join.

Also, I have little interest in integrating rpy2 into lifetimes; sorry for not clarifying that earlier. My director has R experience and floated the idea for our internal project deployment, but that's an excellent point you make about the added Dockerfile complexity. I'll be sure to bring it up.

If I had to pick another language to incorporate into lifetimes, it would be Stan, which prophet uses under the hood for MCMC inference of the hyperparameters:

https://github.com/facebook/prophet/blob/main/python/stan/unix/prophet.stan

@ColtAllen
Copy link

@deepyaman @RodrigoRivera @gpyga @shgidi I’m pushing back this Zoom call because I’ve sent collaboration invites to others and want to give them the opportunity to join as well. If I don’t hear from any of them by St. Patrick’s Day, we can go forward with meeting on 20-Mar or any other Sunday you prefer.

I’ve been reviewing the choices of backend for a successor library, and now believe pymc3 and/or Stan are the best options. I’ve found code implementations for the BG/NBD and Gamma-Gamma models in pymc3 and Stan, respectively, and have sent collaboration invites to the creators.

pymc3 has the cleanest, most Pythonic syntax of any statistical library I’ve worked with, but I stopped using it several years ago because it still used the deprecated theano tensor library as a backend. However, aesara - the successor backend they’ve developed - seems quite mature now, and both aesara & pymc3 have huge developer communities to reach out to for support. @CamDavidsonPilon himself has even written an eBook about pymc3; I do hope he’s able to join the Zoom call and/or assist in a technical advisory capacity.

Lastly, I've forked this repo and have invited you all to be collaborators:

https://github.com/ColtAllen/lifetimes

I haven’t done much yet aside from update the README, but I’ll be adding some new research paper links and making other minor documentation changes here shortly.

@CamDavidsonPilon
Copy link
Owner

I appreciate the invitation to join the call and provide advice, but I don't think I would add much! I would like to express my excitement about a successor library being built with probabilistic programming tools - that was a future vision of mine for these RFM techniques. Best of luck, folks!

@ColtAllen
Copy link

ColtAllen commented Mar 20, 2022

Zoom call is scheduled for Sunday, 27-Mar at 10 AM Mountain Standard Time (GMT-6:00)

I've been receiving messages from other interested parties on LinkedIn, so I'm delaying the Zoom call by one more week to give others the chance to discover this discussion and join.

I've already started working on a MCMC implementation of the Beta-Geo model. MCMC has challenges of its own, but according to this paper it has far less convergence issues than the current MLE approach, which will solve a lot of problems people have with using this library:

Worth the effort? Comparison of different MCMC algorithms for estimating the Pareto/NBD model

Join Zoom Meeting
https://us02web.zoom.us/j/81938221716

Meeting ID: 819 3822 1716
One tap mobile
+12532158782,,81938221716# US (Tacoma)
+13462487799,,81938221716# US (Houston)

Dial by your location
+1 253 215 8782 US (Tacoma)
+1 346 248 7799 US (Houston)
+1 669 900 6833 US (San Jose)
+1 301 715 8592 US (Washington DC)
+1 312 626 6799 US (Chicago)
+1 929 436 2866 US (New York)
Meeting ID: 819 3822 1716
Find your local number: https://us02web.zoom.us/u/kCp1rZoUe

@ColtAllen
Copy link

Thanks @deepyaman, @juanitorduz, and everyone else for attending the Zoom call today. Here's a summary of what we discussed:

Identified Library Issues

  • Python syntax does not include type hinting.
  • Instability with scipy.hypf21 when using pandas inputs, particularly with GammaGammaFitter
  • Lack of options for plotting & quantifying uncertainty.
  • No standard error estimation of Hessian matrix during inference/optimization.
  • RFM aggregations computationally prohibitive with large datasets.
  • Plotting functions have extraneous dependencies on other methods in the library, limiting flexibility.
  • Difficult to determine if calibration and holdout datasets are equally distributed since they can only be split by time period, and is an incomplete approach to model evaluation.
  • MLE convergence not stable:
    • autograd dependency deprecated two years ago.
    • Current log-likelihood formulations can cause optimizers to crash.
    • Current MLE penalizer assumptions ill-suited for parameter estimation.
    • Model assumptions not being tested.

Development Priorities

  1. Update documentation to add the contents of this message, updated contributor's guide, and links to research papers.
  2. Coveralls integration for testing coverage (I'm working on this now.)
  3. Merge PR of BetaGeo Time-invariant Covariates model submitted by @meremeev.
  4. Add type hinting, and separate utility method dependencies from plotting methods.
  5. pymc backend integration into BaseFitter class. Current MLE approach will be replaced with 'find_MAP' function in pymc4, which is expected to be released Apr/May 2022.
  6. Expand model evaluations with the Gelman-Rubin statistic, posterior predictive checks, and other methods.
  7. As the pymc4 overhaul is ongoing, be mindful of existing issues that have been identified, like log-likelihood formulations and scipy.hypf21 contributing to convergence instability. Add bug fixes whenever these problems arise.

Future Additions

  • Support for Hierarchical Bayesian models
  • Additional models
  • Poetry for model packaging
  • Nox for version testing (cannot use with poetry, so either of these two must be resolved)
  • Distribute RFM aggregation with dask
  • Stan backend integration (this will add considerable overhead to the project. If I don't get a lot of requests and PRs related to Stan I will not be pursuing this)

Future work will continue in the fork I've created:
https://github.com/ColtAllen/lifetimes

@ColtAllen
Copy link

ColtAllen commented Jun 20, 2022

An alpha release of the successor library - rebranded as btyd - is now available for pip install:
https://github.com/ColtAllen/btyd

@ColtAllen
Copy link

ColtAllen commented Jul 29, 2022

The btyd successor library is now in Beta:
https://github.com/ColtAllen/btyd

@ColtAllen
Copy link

Second beta release of the btyd successor library is now available for pip install:
https://github.com/ColtAllen/btyd

@ColtAllen
Copy link

Third beta release of btyd is now available for pip installation! This one includes a Bayesian variant of the Modified BG/NBD model, a few bug fixes, and some requested additions to the existing lifetimes models.

@ColtAllen
Copy link

I've decided to merge efforts with the PyMC Labs team and work on the pymc-marketing project, which will become the premiere solution for CLV modeling going forward. BTYD has been a solo project of mine ever since I forked this library, but this is now a community effort!

@CamDavidsonPilon , please update the README to reflect this, thank you.

@CamDavidsonPilon
Copy link
Owner

Neat! Looks like a fun project!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants