grawler
所属分类:数据采集/爬虫
开发工具:GO
文件大小:7KB
下载次数:0
上传日期:2016-05-09 19:09:59
上 传 者:
sh-1993
说明: 用果朗语编写的网络爬虫抓取引擎
(A web crawler scraper engine written in Golang)
文件列表:
LICENSE (1081, 2016-05-10)
configs.go (1436, 2016-05-10)
extractor (0, 2016-05-10)
extractor\css.go (477, 2016-05-10)
extractor\xpath.go (824, 2016-05-10)
item.go (224, 2016-05-10)
processors (0, 2016-05-10)
processors\text.go (1302, 2016-05-10)
spider.go (5498, 2016-05-10)
# goscrape
A Go scraper inspired by Python's Scrapy (http://scrapy.org).
This repository is still a work in progress
Roadmap:
1. Better logging
2. Exporting scraped content to JSON, XML or other formats
3. Web Interface for viewing jobs / creating dynamic scrapers
4. Sitemap spider
5. **Tests**
# Example Code
```
package main
import (
"fmt"
"log"
"strings"
"github.com/rakanalh/goscrape"
"github.com/rakanalh/goscrape/extract"
"github.com/rakanalh/goscrape/processors"
)
func parseHandler(response *goscrape.Response) goscrape.ParseResult {
log.Println(response.URL)
if !strings.Contains(response.URL, "category") {
return goscrape.ParseResult{}
}
xpath, err := extract.NewXPath(response.Content)
nodes, err := xpath.Extract("//ul[@class=\"list-thumb-right\"]/li/a[position()=1]/@href")
urls, err := processors.GetLinks(nodes)
// css, err := extract.NewCss(response.Content)
// selection, err := css.Extract("ul.list-thumb-right > li > a")
// urls, err := processors.GetLinks(selection)
fmt.Println(urls)
if err != nil {
panic("Could not parse Xpath")
}
log.Println("Processing URL: " + response.URL)
for i, url := range urls {
urls[i] = "http://www.testdomain.net" + url
}
result := goscrape.ParseResult{
Urls: urls,
}
return result
}
func itemHandler(item goscrape.ParseItem) {
fmt.Println(item)
}
func main() {
startUrls := []string{
"http://www.testdomain.net/category/test_uri",
}
configs := goscrape.Configs{
StartUrls: startUrls,
WorkersCount: 3,
}
spider := goscrape.NewSpider(&configs, parseHandler, itemHandler)
spider.Start()
}
```
近期下载者:
相关文件:
收藏者: