Web-Crawler
所属分类:数据采集/爬虫
开发工具:Python
文件大小:7KB
下载次数:0
上传日期:2014-03-07 04:38:09
上 传 者:
sh-1993
说明: python中一个以原始为中心的网络爬虫
(A primitive focused web crawler in python)
文件列表:
crawler.py (11357, 2014-03-07)
Web_Crawler
===========
A Web crawler is an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing. A Web crawler may also be called a Web spider, an ant, an automatic indexer, or a Web scutter. Web search engines and some other sites use Web crawling or spidering software to update their web content or indexes of others sites' web content. Web crawlers can copy all the pages they visit for later processing by a search engine that indexes the downloaded pages so that users can search them much more quickly.
This web crawler is a focused crawler which takes in a query from the user. It then get the top ten google search results and starts crawling those urls simultaneously using multithreading. For every page that is getting crawled, word occurance count is maintained and all the links are extracted from the page. These extracted links are then recursively crawled based on page relevance (ie. if query string is present on the page). A max Priority Queue is maintained to store the pages (word count as priority) for any future use for page ranking.
All the relevant links that are extracted are saved in a file "links.txt". The accuracy of crawler is calculated along with the total quantity of data that was downloaded in Mb's.
------------------------------------------------------------------------
1) Program Structure
class Crawler:
def __init__(self, thread_id, url, query, file, lock):
def pagevisited(self, url):
def crawl(self, url):
def run(self):
def main():
def getgoogle(query):
------------------------------------------------------------------------
2) Input Data
Query : Word/words to be searched
Page Limit Number : Number of pages that are to be retireved
Debug Mode : Prints out the exception messages
------------------------------------------------------------------------
3) Execution
Step 1> main -> getgoogle
Step 2> main -> Crawler.start (x10) -> run
Step 3> run -> crawl
Step 4> crawl-> crawl
Step 1> Execution begins at the main() method which prompts the user
for query input and for number of pages to be found.The query
is then passed to getgoogle() method which returns the top 10
search results for the given query.
Step 2> The main() method then iterates over each url returned by
getgoogle() and creates a thread object Crawler for each url.
Each thread is started and the run() method is invoked for
every instance of thread.
Step 3> The run() method then parses the url passed to it and extracts
all links on the page. It then iterates over all the extracted
links and calls the crawl() method for crawling each url.
Step 4> The crawl() method parses the url passed to it. It checks if
the link has already been visited or not. It then checks the
robots.txt file of the host url and checks if the current url
can be accessed. Then the word count is calculated and the
url is pushed into the priority queue along with the word count
as page priority. ALso the crawl() method checks if the url
if an anchor jump link. Once the page information is written
onto the file links.txt, all the links on the current page
are extracted. Iterating over these links, each link is
passed to crawl() method as a recursive call.
The program execution continues till the specified number of pages have
been found or till the crawler has crawler all the relevant links. Any
exception that is thrown is ignored unless the program is running in
debug_mode.
Also, since there are multiple threads running, we used thread locks
while writing data onto the file since we don't want more than one
thread writing onto the file.
------------------------------------------------------------------------
4) Output Data
All the relevant links are saved in a file name links.txt in the same
directory. The program also calculates the approximate amount of data
that was downloaded and what percent of it was relevant.
------------------------------------------------------------------------
------------------------------------------------------------------------
Libraries Used :
urllib, lxml, heapq, json, math, sys, threading, robotparser
Required library installation :
for lxml use : pip install lxml
To Execute :
Run the crawler.py python file.
No need to create any other file separately.
Extracted links will be saved in links.txt file in same directory.
------------------------------------------------------------------------
近期下载者:
相关文件:
收藏者: