This is the project repository for CMU 10-615 Art and Machine Learning's final project. Given a spoken speech as input, we generate a film that consits of music, lyrics, and images.
Our final outputs can be found under the /video
directory.
- Zhouyao Xie: School of Computer Science, Language Technology Institute, Master of Computational Data Science
- Nikhil Yadala: School of Computer Science, Language Technology Institute, Master of Computational Data Science
- Yifan He: College of Fine Arts, School of Music, Music and Technology
- Guannan Tang: College of Engineering, Materials Science Department
Our report is included in this repository (see report.pdf
). You can also check out our report via this link.
Our presentation slide has also been uploaded to this repo (see presentation.pdf
). It could also be found here.
[1] Liu, Xingchao, Chengyue Gong, Lemeng Wu, Shujian Zhang, Hao Su, and Qiang Liu. "FuseDream: Training-Free Text-to-Image Generation with Improved CLIP+ GAN Space Optimization." arXiv preprint arXiv:2112.01573 (2021).
[2] Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry et al. "Learning transferable visual models from natural language supervision." In International Conference on Machine Learning, pp. 8748-8763. PMLR, 2021.
[3] HuggingfaceArtists models: https://huggingface.co/huggingartists
[4] Dhariwal, Prafulla, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever. "Jukebox: A generative model for music." arXiv preprint arXiv:2005.00341 (2020).