Skip to content
This repository was archived by the owner on Apr 12, 2024. It is now read-only.

Holovin/PythonParsersGrab

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

274 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DParsers-Grab-Core (v2.91)

Common core for site parsing with python grab framework.

Install Python (pre-install)

  1. Install Python 3.9
wget -O python.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash python.sh -b
rm python.sh
export PATH=~/miniconda3/bin:$PATH
  1. Install pipenv (pip install pipenv)

Project install

  1. Clone project
  2. In project directory pipenv install
  3. [Optional for Windows] Download and install curl

Running

  1. Run pipenv shell
  2. Run python main.py {SITE_CONFIG_FILE_NAME}

Base config .env description

All values must be strings:

  • APP_WORK_MODE β€” dev value sets DEBUG mode for all loggers, supports info value, otherwise set error

  • APP_CAN_OUTPUT β€” True allows to print(...) some important messages

  • APP_LOG_FORMAT β€” log format (in python logger format)

  • APP_LOG_DIR β€” log directory name

  • APP_LOG_DEBUG_FILE β€” log file name (only own code output)

  • APP_LOG_GRAB_FILE β€” log file name(only grab lib output)

  • APP_LOG_HTML_ERR β€” output html in log when occur any exception

  • APP_CACHE_ENABLED β€” enable a page caching to your db (any value to enable)

  • APP_CACHE_DB_HOST β€” db host

  • APP_CACHE_DB_PORT β€” db post (default = 3306)

  • APP_CACHE_DB_TYPE β€” db type (support mysql, mongo and some others - look grab docs)

  • APP_CACHE_DB_USER β€” db user

  • APP_CACHE_DB_PASS β€” db password

Base config {site}.env description/

  • APP_PARSER β€” name of file which store parser logic (Spider extended class)
  • APP_THREAD_COUNT β€” count of threads for grub.spider
  • APP_TRY_LIMIT β€” how many times app can repeat failed task
  • APP_SAVER_CLASS β€” save to CSV or JSON format (or you can write own saver) [can occur crash when use csv with nested dicts]
  • APP_OUTPUT_CAT β€” save file mode: '' (empty) for single file (and same behaviour when this property not defined), 'test' - for separate result data to single files by 'test' result fields
  • APP_OUTPUT_DIR β€” output dir
  • APP_OUTPUT_ENC β€” output encoding [default 'utf-8']
  • APP_SAVE_FIELDS_{NUMBER} β€” string name fields for saving in a file (other fields dropped, even if parsed)
  • APP_COOKIE_NAME and APP_COOKIE_VALUE (both optional) β€” set this cookie before all requests
  • SITE_URL_{NUMBER} β€” site url's for parse
  • INPUT_URLS_FILENAME β€” *.txt file with url's list (newline separator) for load into self.links_todo in a parser class (for parsing with simple links list instead dynamic xpath rules)

About

Wrapper for grab (python framework) with some additional features

Topics

Resources

License

Stars

Watchers

Forks

Contributors

Languages