Common core for site parsing with python grab framework.
- Install Python 3.9
wget -O python.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash python.sh -b
rm python.sh
export PATH=~/miniconda3/bin:$PATH
- Install pipenv (
pip install pipenv)
- Clone project
- In project directory
pipenv install - [Optional for Windows] Download and install curl
- Run
pipenv shell - Run
python main.py {SITE_CONFIG_FILE_NAME}
All values must be strings:
-
APP_WORK_MODEβdevvalue sets DEBUG mode for all loggers, supportsinfovalue, otherwise seterror -
APP_CAN_OUTPUTβTrueallows toprint(...)some important messages -
APP_LOG_FORMATβ log format (in python logger format) -
APP_LOG_DIRβ log directory name -
APP_LOG_DEBUG_FILEβ log file name (only own code output) -
APP_LOG_GRAB_FILEβ log file name(only grab lib output) -
APP_LOG_HTML_ERRβ output html in log when occur any exception -
APP_CACHE_ENABLEDβ enable a page caching to your db (any value to enable) -
APP_CACHE_DB_HOSTβ db host -
APP_CACHE_DB_PORTβ db post (default = 3306) -
APP_CACHE_DB_TYPEβ db type (support mysql, mongo and some others - look grab docs) -
APP_CACHE_DB_USERβ db user -
APP_CACHE_DB_PASSβ db password
APP_PARSERβ name of file which store parser logic (Spider extended class)APP_THREAD_COUNTβ count of threads for grub.spiderAPP_TRY_LIMITβ how many times app can repeat failed taskAPP_SAVER_CLASSβ save to CSV or JSON format (or you can write own saver) [can occur crash when use csv with nested dicts]APP_OUTPUT_CATβ save file mode: '' (empty) for single file (and same behaviour when this property not defined), 'test' - for separate result data to single files by 'test' result fieldsAPP_OUTPUT_DIRβ output dirAPP_OUTPUT_ENCβ output encoding [default 'utf-8']APP_SAVE_FIELDS_{NUMBER}β string name fields for saving in a file (other fields dropped, even if parsed)APP_COOKIE_NAMEandAPP_COOKIE_VALUE(both optional) β set this cookie before all requestsSITE_URL_{NUMBER}β site url's for parseINPUT_URLS_FILENAMEβ *.txt file with url's list (newline separator) for load intoself.links_todoin a parser class (for parsing with simple links list instead dynamic xpath rules)