AAI3003_Project
所属分类:自然语言处理
开发工具:Jupyter Notebook
文件大小:0KB
下载次数:0
上传日期:2024-04-01 13:13:20
上 传 者:
sh-1993
说明: 使用新闻文章的NLP项目
(NLP Project using news articles)
文件列表:
dataset/
docs/
metrics/
models/
.pylintrc
LICENSE
Pipfile
Pipfile.lock
bart.ipynb
bart.py
bert_training.py
distilbert_predict.py
feature_extract.py
format.ipynb
get_urls.py
requirements.txt
scraper.py
summarise.py
tf_idf.ipynb
tf_idf.py
# News Article Summariser
Web Scraping Tool for Extracting News Headlines & Summariser
### Description
This Python script is designed for web scraping news headlines from the Today Online website. It utilizes the BeautifulSoup and Selenium libraries to extract data from the HTML content of the specified URL.
### Prerequisites
Make sure you have the following installed before running the script:
Python (3.12 recommended)
Required Python packages as listed in the requirements.txt
Download the required geckodriver for your OS from https://github.com/mozilla/geckodriver/releases
You can install the required packages using the following command:
```
pip install -r requirements.txt
```
### Usage
```
python main.py
```
### Configuration
The script is currently set to run in headless mode using Firefox. If needed, you can customize the web driver options or use a different browser by modifying the script.
```
firefox_options = webdriver.FirefoxOptions()
firefox_options.add_argument("--headless")
browser = webdriver.Firefox(service=FirefoxService(GeckoDriverManager().install()), options=firefox_options)
```
Add this line of code to virtual environment activate file to not get the API limit exceed error.
Replace it with your own Github Personal Access Token
```
export GH_TOKEN = "asdasdasdasd"
```
### License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
Beautiful Soup
Selenium
Webdriver Manager
近期下载者:
相关文件:
收藏者: