Add docs #328

lahdjirayhan · 2021-12-12T18:02:49Z

I think I can try writing some docs (docstrings, examples). I may not be able to document all scrapers and all the machinery in it at the moment, just the ones I've used or can fairly understand. That being said, I think having some sort of documentation on this library is still a good thing to do.

Should I continue working on this, @JustAnotherArchivist? I apologize in advance if this PR feels like it comes out of nowhere.

sidenote: I'm not sure the difference between #6 and #7, and whether this PR is solving which.

The docs currently resides in index. Mostly are result of sphinx quickstart. Will move later. The docstrings are still in very initial phase. I wasn't experienced enough with the library to warrant a thorough docstring.

also move/restructure the rst

TheTechRobo · 2021-12-12T18:17:47Z

I think #6 is more say the limitations and workings of each individual scraper, and #7 is documenting the Python API

JustAnotherArchivist · 2021-12-14T08:51:35Z

Thank you for your attempt at this! The lack of documentation is definitely a major issue, so I do appreciate this!

One thing that immediately sticks out for me is something I'm deeply allergic to, though I'm not sure I spelt it out anywhere before, so apologies for that. It's duplication of information. It makes future maintenance harder because one has to remember to update things in multiple places. This is inevitably forgotten eventually, which then leads to inconsistencies and confusion.

There are three primary examples here:

The twitter-reference.rst file. I'd like to avoid listing the scrapers and even their modules in the docs at all. Instead, they should be discovered automatically through snscrape.modules. Apparently sphinx.ext.autosummary might be able to do this (with hacks). (Basically, adding or removing scrapers or modules should not require any changes to the docs directory.)
Function signatures, i.e. argument types, default values, and return value type. This should all happen through type hints, and it appears that Sphinx supports that with autodoc_typehints = "description" (which is even enabled in your docs/conf.py). Default values appear to require an extension.
- In an ideal world, the argument and return value descriptions would also live in annotations (PEP 593); this is what I mentioned in Document the individual modules better #6 and doesn't seem to be supported anywhere yet. But at least there's only minimal duplication there (the argument names), so that's acceptable for now.
The package metadata and version in docs/conf.py. Note that the version is not recorded anywhere in the code; it comes from setuptools_scm via git tags. This should instead require an installation of snscrape at doc build time and then fetch that data using importlib.metadata.

It probably makes sense to split the documentation writing into three parts: type hints and docstrings for all public API, separate documentation (like a general introduction, CLI vs Python layer, and the Twitter example), and everything directly related to Sphinx (configuration etc.). Most of my points above fall into the last part. Perhaps you'd like to focus on just one for now?

A couple comments on style. For docstrings, it should be this, in line with the few existing ones:

def func(foo: str) -> str:
	'''Basic description

	Args:
		foo: a description of foo's significance
	Returns:
		a description of the return value
	'''

	return foo

That is: single quotes, no empty line at the end of the docstring, and an empty line between the docstring and the code. Further, I think the class description should go into a class-level docstring, and only __init__ argument description into the method docstring.

For the rST files, the same style as for my Python code should be used: tabs for indentation, spaces for alignment. Lines should be broken only where it makes sense (i.e. one paragraph = one line); I'm not a fan of breaking text into lines at random points based on 1980s-era screen widths. :-)

That's all I can think of right now. Looking forward to seeing where this is heading! :-)

JustAnotherArchivist · 2021-12-14T08:55:27Z

And @TheTechRobo's comment is correct. #6 is about high-level documentation of the existing scrapers, what they target, returned item fields, etc., while #7 is about the usage of snscrape from Python. There's some overlap though, naturally.

as noted in PR thread JustAnotherArchivist#328

lahdjirayhan · 2021-12-19T15:40:06Z

Thank you for your feedback. This is one of my first attempts to contribute to a project that's not my own, so I understand that I need to adjust some style accordingly.

I also agree with the points you brought up about (avoiding) code duplication. I dislike unnecessary duplication myself. It's just that I wasn't (and still am not) experienced with setting up Sphinx, and I think that showed in my previous attempts. I need to look further and try to configure Sphinx better so that docs/ directory (and especially, the "API reference"-like section) will not need to be separately updated and maintained.

I have some questions, though:

Instagram: currently, each scraper class still needs the mode parameter to be specified, despite the already specific class names. What should be done here?
1. Recommend users to use InstagramCommonScraper and set the mode themselves. Or,
2. Setup each Instagram scrapers so that the mode is set on each class' __init__. For example, something like this:

class InstagramUserScraper(InstagramCommonScraper):
    ...
    def __init__(self, name, **kwargs):
        super().__init__(mode="User", **kwargs)
   ...

Reddit: technically I can import the specific scraper classes, but they are not listed in the source code. Instead, they are available through _make_scraper. I understand this is done to vastly reduce code duplication, because each scraper class will just contain mostly the same code anyway. But, how should the docstrings be written in this case?

JustAnotherArchivist · 2021-12-23T04:37:17Z

Yeah, I have little experience with Sphinx myself as well. Incidentally, that's part of why I put off working on the docs for such a long time. But I'm sure we can find a good solution.

Good points on the Instagram and Reddit scrapers. Yes, both of those are a mess and should be refactored. I'll look into that.

On that note, I should also revisit the public vs private parts of the code. For example, I don't consider InstagramCommonScraper part of the public API; people should only use the subclasses like InstagramUserScraper. Documentation is much less important for the private ones, naturally. I'll do some cleanup there as well. Expect a decent number of renames everywhere. Hopefully it won't be too messy to rebase your branch afterwards.

Cf. #328

Add initial edits into conf.py. I'm trying things out right now.

copy-pasted from https://stackoverflow.com/a/62613202

The previous commit (copypaste template from SO) has following behavior when tested on my local device: 1. It generated all necessary stub .rst files under /_autosummary. (good) 2. It generated all HTMLs for each module, submodule, and class. (good) 3. (the bug) snscrape.modules page lists -- but does not link to any of -- its submodules. The individual HTMLs for each submodule exist, it is just not linked in toctree. I can open each resulting HTML manually on my browser. I've did some trial-and-error, this commit is reflecting the minimal change that makes the snscrape.modules page links to its submodules. At least it works.

lahdjirayhan · 2022-01-11T02:29:45Z

I have questions:

Inconsistent style in Weibo vs Twitter user scrapers #354 (resolved)
CLI-related scraper methods shouldn't be public #355 (resolved)

TheTechRobo

Just something that confused me for a second

snscrape/modules/twitter.py

Yields belong to get_items instead of __init__

TheTechRobo

A quick skim found this

docs/conf.py

docs/index.rst

TheTechRobo · 2022-02-20T18:35:50Z

I'm getting this:

themandownstairs:~/sndocs git:add-docs ❯❯❯ sphinx-build -b html docs out                                             2
Running Sphinx v4.4.0

Configuration error:
There is a programmable error in your configuration file:

Traceback (most recent call last):
  File "/home/thetechrobo/.local/lib/python3.9/site-packages/sphinx/config.py", line 340, in eval_config_file
    exec(code, namespace)
  File "/home/thetechrobo/sndocs/docs/conf.py", line 35, in <module>
    _major, _minor, _patch, _yyyymmdd = release.split('.')
ValueError: too many values to unpack (expected 4)

TheTechRobo · 2022-02-20T18:36:24Z

Oh, I think it's because I'm on the dev version.

You should add a check for that

lahdjirayhan · 2022-02-21T14:23:08Z

@TheTechRobo,

It does not happen on my end. The following is the snscrape version:

MINGW64 /e/Git-LINE/justanotherarchivist-snscrape/snscrape (add-docs)
$ snscrape --version
snscrape 0.3.5.dev231+g0832e95

>>> from importlib.metadata import metadata
>>> M = metadata('snscrape')
>>> M['version']
'0.3.5.dev231+g0832e95'

dev231+g0832e95 isn't exactly a YYYYMMDD date format and I probably should check against that, but it shouldn't return the error you mentioned as well. Maybe you have different output for snscrape --version?

@JustAnotherArchivist,

I'm not sure why my version is still at 0.3.5.dev ... surely it should be at 0.4.x by now? I've merged master to my branch every now and then, too. Is this okay? I installed snscrape in a virtual environment using pip install -e ., if it matters.

JustAnotherArchivist · 2022-02-21T14:57:20Z

@lahdjirayhan The version is only updated when you run pip because it's stored in the egg-info. Try rerunning pip install -e . (no uninstall needed) and then checking again. I currently have a local version number like 0.4.3.20220107.dev29+g77bbb9f.d20220220 for example. (That date part at the end signifies that there are uncommitted changes.)

JustAnotherArchivist · 2022-02-21T15:00:31Z

Also, apologies, I didn't notice that you marked this as ready for review. I'll take a look soon!

TheTechRobo

I really don't see anything other than these minor style questions. This is great!

TheTechRobo · 2022-02-21T15:06:42Z

snscrape/modules/facebook.py

+		'''
+		Args:


To match up with the rest of the docstrings, shouldn't there not be a newline there? (very minor i know but...)

Args, Kwargs, Returns, Yields, etc. is usually positioned unindented in the left according to Google style. In addition to that, I feel uncomfortable putting Args: right after ''' because it is usually reserved for docstring summary.

On the other hand, I don't think init docstring needs more explanation. The relevant information for each class is already given in the class docstring.

TheTechRobo · 2022-02-21T15:07:08Z

snscrape/modules/instagram.py

+		'''
+		Args:


Same question here.

lahdjirayhan · 2022-02-21T19:04:07Z

@JustAnotherArchivist Thanks for the info on the version. I'll try it out as soon as I can.

Update: Nope, I've tried rerunning pip install -e . again and it still shows the same 0.3.5.dev version. In addition to that, running git tag shows only up to v.0.3.4.

TheTechRobo

A very quick skim leavs me saying LGTM.

TheTechRobo · 2022-03-30T22:10:34Z

@lahdjirayhan I'm not sure that merging the branch in also gets the tags. That could be the problem.

TheTechRobo · 2022-07-03T02:22:21Z

docs/index.rst

+
+#. **Instantiate a scraper object.**
+	``snscrape`` provides various object classes that implement their own specific ways. For example, :class:`TwitterSearchScraper` gathers tweets via search query, and :class:`TwitterUserScraper` gathers tweets from a specified user.
+#. **Call the scraper's** ``get_item()`` **method.**


get_item()?

TheTechRobo · 2022-07-03T02:22:29Z

docs/index.rst

+#. **Instantiate a scraper object.**
+	``snscrape`` provides various object classes that implement their own specific ways. For example, :class:`TwitterSearchScraper` gathers tweets via search query, and :class:`TwitterUserScraper` gathers tweets from a specified user.
+#. **Call the scraper's** ``get_item()`` **method.**
+	``get_item()`` is an iterator and yields one item at a time.


lahdjirayhan added 7 commits December 11, 2021 07:36

Quickstart docs, add mustache template

c39db6e

Modify mustache file

15eba41

Update .gitignore to not track documentation builds

df049bc

Delete build directory

6979c5d

Add initial documentation and docstring

a2d2750

The docs currently resides in index. Mostly are result of sphinx quickstart. Will move later. The docstrings are still in very initial phase. I wasn't experienced enough with the library to warrant a thorough docstring.

Add example/tutorial

ee09d9a

also move/restructure the rst

Merge branch 'master' into add-docs

80d1981

lahdjirayhan added 11 commits December 15, 2021 01:06

Update .gitignore to not track .vscode

a61347e

Modify my docstring to match owner expectation

aa38bf0

as noted in PR thread JustAnotherArchivist#328

Rewrite index.rst

2dade08

Merge branch 'master' into add-docs

a844c45

Add examples

c2eabfb

Add docstrings on Twitter module

00e08d8

Add docstrings on Instagram module

b49fef6

Add docstrings on Telegram module

34cf780

Add docstring to Reddit module

c62a9b4

Add docstring to VK module

4e2d184

Fix docstring formatting

a733e26

JustAnotherArchivist added a commit that referenced this pull request Dec 24, 2021

Refactor Instagram scrapers to get rid of the awkward mode parameter

4dd3ee6

Cf. #328

JustAnotherArchivist added a commit that referenced this pull request Dec 24, 2021

Refactor Reddit scrapers into a more reasonable code structure

eee06d8

Cf. #328

lahdjirayhan added 5 commits December 27, 2021 06:02

Merge branch 'master' into backup-add-docs

75b287b

Try autosummary

ab1dbe9

Add initial edits into conf.py. I'm trying things out right now.

Update .gitignore to not track autogenerated _autosummary

b5dcf41

Add templates

26fedeb

copy-pasted from https://stackoverflow.com/a/62613202

JustAnotherArchivist added the documentation label Jan 11, 2022

lahdjirayhan added 8 commits January 15, 2022 23:57

Update/add docstrings

c9a5c08

Merge branch 'master' into add-docs

845ff32

Update/add docstrings again

a10a195

Add/update docs for mastodon objects

8e697a3

Add/update docs for twitter

955bee8

Update .gitignore to not track venv folder

31d495e

Detect snscrape version in docs using importlib

36f4d0e

Update index.rst to add mastodon

2cb811b

TheTechRobo suggested changes Jan 27, 2022

View reviewed changes

snscrape/modules/twitter.py Outdated Show resolved Hide resolved

lahdjirayhan added 2 commits January 29, 2022 01:11

Fix incorrect docstring on TwitterTweetScraper

fe818fa

Yields belong to get_items instead of __init__

Merge branch 'master' into add-docs

80627eb

lahdjirayhan marked this pull request as ready for review February 17, 2022 15:20

TheTechRobo suggested changes Feb 17, 2022

View reviewed changes

docs/conf.py Outdated Show resolved Hide resolved

docs/index.rst Outdated Show resolved Hide resolved

lahdjirayhan added 2 commits February 18, 2022 01:21

Retrieve everything except project name from importlib.metadata

0eebb3b

Fix typo in index.rst

0832e95

TheTechRobo suggested changes Feb 21, 2022

View reviewed changes

TheTechRobo approved these changes Mar 13, 2022

View reviewed changes

TheTechRobo mentioned this pull request May 23, 2022

facebook documentation #473

Closed

TheTechRobo reviewed Jul 3, 2022

View reviewed changes

TheTechRobo mentioned this pull request Apr 26, 2023

How to see all snscrape.modules? #863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs #328

Add docs #328

lahdjirayhan commented Dec 12, 2021 •

edited

Loading

TheTechRobo commented Dec 12, 2021

JustAnotherArchivist commented Dec 14, 2021

JustAnotherArchivist commented Dec 14, 2021

lahdjirayhan commented Dec 19, 2021

JustAnotherArchivist commented Dec 23, 2021

lahdjirayhan commented Jan 11, 2022 •

edited

Loading

TheTechRobo left a comment

TheTechRobo left a comment

TheTechRobo commented Feb 20, 2022

TheTechRobo commented Feb 20, 2022 •

edited

Loading

lahdjirayhan commented Feb 21, 2022 •

edited

Loading

JustAnotherArchivist commented Feb 21, 2022 •

edited

Loading

JustAnotherArchivist commented Feb 21, 2022

TheTechRobo left a comment

TheTechRobo Feb 21, 2022

lahdjirayhan Feb 21, 2022

TheTechRobo Feb 21, 2022

lahdjirayhan commented Feb 21, 2022 •

edited

Loading

TheTechRobo left a comment

TheTechRobo commented Mar 30, 2022

TheTechRobo Jul 3, 2022

TheTechRobo Jul 3, 2022

Add docs #328

Are you sure you want to change the base?

Add docs #328

Conversation

lahdjirayhan commented Dec 12, 2021 • edited Loading

TheTechRobo commented Dec 12, 2021

JustAnotherArchivist commented Dec 14, 2021

JustAnotherArchivist commented Dec 14, 2021

lahdjirayhan commented Dec 19, 2021

JustAnotherArchivist commented Dec 23, 2021

lahdjirayhan commented Jan 11, 2022 • edited Loading

TheTechRobo left a comment

Choose a reason for hiding this comment

TheTechRobo left a comment

Choose a reason for hiding this comment

TheTechRobo commented Feb 20, 2022

TheTechRobo commented Feb 20, 2022 • edited Loading

lahdjirayhan commented Feb 21, 2022 • edited Loading

JustAnotherArchivist commented Feb 21, 2022 • edited Loading

JustAnotherArchivist commented Feb 21, 2022

TheTechRobo left a comment

Choose a reason for hiding this comment

TheTechRobo Feb 21, 2022

Choose a reason for hiding this comment

lahdjirayhan Feb 21, 2022

Choose a reason for hiding this comment

TheTechRobo Feb 21, 2022

Choose a reason for hiding this comment

lahdjirayhan commented Feb 21, 2022 • edited Loading

TheTechRobo left a comment

Choose a reason for hiding this comment

TheTechRobo commented Mar 30, 2022

TheTechRobo Jul 3, 2022

Choose a reason for hiding this comment

TheTechRobo Jul 3, 2022

Choose a reason for hiding this comment

lahdjirayhan commented Dec 12, 2021 •

edited

Loading

lahdjirayhan commented Jan 11, 2022 •

edited

Loading

TheTechRobo commented Feb 20, 2022 •

edited

Loading

lahdjirayhan commented Feb 21, 2022 •

edited

Loading

JustAnotherArchivist commented Feb 21, 2022 •

edited

Loading

lahdjirayhan commented Feb 21, 2022 •

edited

Loading