html2md

所属分类:博客
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2023-10-12 09:46:45
上 传 者sh-1993
说明:  一个python脚本,它读取debian wiki新闻页面,并输出呈现相同页面的标记文件。,
(A python script that reads the debian wiki news page and spits out a markdown file that renders the same page.,)

文件列表:
main.py (6350, 2023-10-23)
poetry.lock (22914, 2023-10-23)
pyproject.toml (514, 2023-10-23)
requirements.txt (254, 2023-10-23)
test_main.py (3620, 2023-10-23)
thought-process.md (5004, 2023-10-23)

# html2md [![Tests](https://github.com/mungai-njoroge/html2md/actions/workflows/run_tests.yml/badge.svg)](https://github.com/mungai-njoroge/html2md/actions/workflows/run_tests.yml) A python script that reads the [debian news page](https://wiki.debian.org/News) and spits out a markdown file that renders the same page. > [Read the thought process](https://github.com/mungai-njoroge/html2md/blob/main/thought-process.md) ## Running it Clone this repo locally, create a virtual environment, install dependencies and run `main.py`. ```sh git clone https://github.com/mungai-njoroge/html2md.git cd html2md ``` If you have [Poetry](https://python-poetry.org) installed: ```sh poetry install # run main.py poetry run python main.py ``` Without Poetry: ```sh # create virtual environment python -m venv venv # activate it source venv/bin/activate # install dependencies pip install -r requirements.txt # run script python main.py ``` ## Libraries used - [requests](https://pypi.org/project/requests/) - Downloading webpage - [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) - Parsing Html into tree structure - [markdownify](https://github.com/matthewwithanm/python-markdownify) - Generating Markdown from a string ## How it works The page is fetched using the [requests](https://pypi.org/project/requests/) package and then parsed into a tree structure using [BeautifulSoup](https://pypi.org/project/beautifulsoup4/). Important information that can be used by a wiki engine is extracted from the page and stored to be used as front matter in the final markdown file. The relevant section of the webpage is inside the element with id `content`. This section is singled out using the [`BeautifulSoup.find`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find) method. Unneeded elements in the 'content' are identified and removed using the [`BeautifulSoup.decompose`](https://www.crummy.com/software/BeautifulSoup/bs4/doc/#decompose) method. The [`markdownify`](https://github.com/matthewwithanm/python-markdownify) package is then used to generate markdown from the remainder. ## Running tests Tests are defined in the `test_main.py` file. You can run them by running `pytest` (which was installed as a dependency). ```sh python -m pytest ``` With Poetry: ```sh poetry run python -m pytest ```

近期下载者

相关文件


收藏者