this is a sight Parser to help you pre_parser the datas from specified website url or api
, it help you get ride of the duplicate coding to get the request from the specified url and speed up the process with the threading pool
and you just need focused on the bussiness proceess coding after you get the specified request response from the specified webpage or api urls
as this slight pre_parser for the old version 1.0.0, which only can help preparser the static html
or api
inform, but now from the 2.0.0 , I have added an new html_dynamic
mode, which will help get all inform even generated by the JS
code.
python version >= 3.9
$ pip install preparser
Github Resouce ➡️ Github Repos
and also just feel free to fork and modify this codes. if you like current project, star ⭐ it please, uwu.
PyPI: ➡️ PyPI Publish
here below are some of the parameters you can use for initial the Object PreParser
from the package preparser
:
Parameters | Type | Description |
---|---|---|
url_list | list | The list of URLs to parse from. Default is an empty list. |
request_call_back_func | Callable or None | A callback function according to the parser_mode to handle the BeautifulSoup object or request json Object. and if you want to show your business process failed, you can return None , otherwise please return a not None Object. |
parser_mode | 'html' , 'api' or 'html_dynamic' |
The pre-parsing datas mode,default is 'html' .html : parse the content from static html, and return an BeautifulSoup Object. api : parse the datas from an api, and return the json Object. html_dynamic : parse from the whole webpage html content and return an BeautifulSoup Object, even the content that generated by the dynamic js code. **and all of Object you can get when you defined the request_call_back_func , otherwise get it via the object of PreParer(....).cached_request_datas |
cached_data | bool | weather cache the parsed datas, defalt is False. |
start_threading | bool | Whether to use threading pool for parsing the data. Default is False . |
threading_mode | 'map' or 'single' |
to run the task mode, default is single . map : use the map func of the theading pool to distribute tasks. single : use the submit func to distribute the task one by one into the theading pool. |
stop_when_task_failed | bool | wheather need stop when you failed to get request from a Url,default is True |
threading_numbers | int | The maximum number of threads in the threading pool. Default is 3 . |
checked_same_site | bool | wheather need add more headers info to pretend requesting in a same site to parse datas, default is True ,to resolve the CORS Block. |
html_dynamic_scope | list or None | point and get the specied scope dom of the whole page html, default is None,which stands for the whole page. if this value was set, the parameter should be a list(2) Object. 1. the first value is a tag selecter. for example, 'div#main' mean a div tag with 'id=main', 'div.test' will get the the first matched div tag with 'class = test'. but don't make the selecter too complex or matched the mutiple parent dom, otherwise you can't get their inner_html() correctly or time out, and finally you can get the BeautifulSoup object of the inner_html from this selecter selected tag in the request_call_back_func . 2. the secound value should be one of the values below: attached : wait for element to be present in DOM. detached : wait for element to not be present in DOM. hidden : wait for element to have non-empty bounding box and no 'visibility:hidden'. Note that element,without any content or with 'display:none' has an empty bounding box and is not considered visible. visible : wait for element to be either detached from DOM, or have an empty bounding box or 'visibility:hidden'. This is opposite to the 'visible' option. |
ssl_certi_verified | bool | wheather need verify the ssl certi when requesting datas from urls, default is True, which means will verify the ssl certi to make the requesting safe. |
# test.py
from preparser import PreParser,BeautifulSoup,Json_Data,Filer
def handle_preparser_result(url:str,preparser_object:BeautifulSoup | Json_Data) -> bool:
# here you can just write the bussiness logical you want
# attention:
# preparser_object type depaned on the `parser_mode` in the `PreParser`:
# 'api' : preparser_object is the type of a Json_Data
# 'html' : preparser_object is the type of a BeautifulSoup
........
# for the finally return:
# if you want to show current result is failed just Return a None, else just return any object which is not None.
return preparser_object
if __name__ == "__main__":
# start the parser
url_list = [
'https://example.com/api/1',
'https://example.com/api/2',
.....
]
parser = PreParser(
url_list=url_list,
request_call_back_func=handle_preparser_result,
parser_mode='api', # this mode depands on you set, you can use the "api", "html",or 'html_dynamic'
start_threading=True,
threading_mode='single',
cached_data=True,
stop_when_task_failed=False,
threading_numbers=3,
checked_same_site=True
)
# start parse
parser.start_parse()
# when all task finished, you can get the all task result result like below:
all_result = parser.cached_request_datas
# if you want to terminal, just execute the function here below
# parser.stop_parse()
# also you can use the Filer to save the final result above
# and also find the datas in the `result/test.json`
filer = Filer('json')
filer.write_data_into_file('result/test',[all_result])
Get help ➡️ Github issue
-
version 2.0.8
: add the func ofread_datas_from_file
intoFiler
, to help read the datas from the specified type files. -
version 2.0.7
: add thessl_certi_verified
parameter to control weather ignored the error that caused by ssl certi verification when do the requesting. -
version 2.0.6
: add thehtml_dynamic_scope
parameters to let user can specified the whole dynamic parse scope, which can help faster the preparser speed when theparser_mode
ishtml_dynamic
. and resort the additional tools into theToolsHelper
package. -
version 2.0.5
: remove the dynamic mode browser core install from setup into package call. -
version 2.0.4
: test the installing process command. -
version 2.0.3
: optimise theerror
alert forhtml_dynamic
. -
version 2.0.2
: correct the README Doc ofparser_mode
. -
version 2.0.1
: update the README Doc. -
version 2.0.0
: add the newparser_mode
of thehtml_dynamic
, which helppreparser
all of the content fromhtml
, event it generated by theJS
code. -
version 1.0.0
: basical version, onlyperparser
the statichtml
andapi
content.