hacker_news_scrape

所属分类:系统/网络安全
开发工具:Python
文件大小:0KB
下载次数:0
上传日期:2024-03-18 18:23:56
上 传 者sh-1993
说明:  黑客新闻帖子的数据刮板和API
(Data scraper and API for Hacker News posts)

文件列表:
alembic/
fixtures/
hacker_news/
images/
sample_data/
tests/
utils/
.coveragerc
.travis.yml
LICENSE
Procfile
alembic.ini
management.py
requirements.txt
runtime.txt
server.py
server.wsgi

[![Build Status](https://travis-ci.com/estherh5/hacker_news_scrape.svg?branch=master)](https://travis-ci.com/estherh5/hacker_news_scrape) [![codecov](https://codecov.io/gh/estherh5/hacker_news_scrape/branch/master/graph/badge.svg)](https://codecov.io/gh/estherh5/hacker_news_scrape) # Hacker News Scrape Hacker News Scrape is a data scraper tool that uses [Requests](http://docs.python-requests.org/en/master/), [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/), and [asyncio](https://docs.python.org/3/library/asyncio.html) libraries to asynchronously acquire and parse the first three pages of posts from the main feed of Y Combinator's news site, [Hacker News](http://news.ycombinator.com/). The data gets stored in an Amazon RDS instance of PostgreSQL and served client-side through a series of API endpoints that return statistics based on time period (e.g., `/api/hacker_news/stats/hour/average_comment_count` returns the average comment count for posts in the past hour, `/api/hacker_news/stats/week/top_website` returns the most common websites that articles were posted from). A front-end visualization of the data can be found at [Hacker News Stats](https://hn-stats.crystalprism.io/), which displays various [Highcharts](https://www.highcharts.com/) visualizations of the scraped data, including a pie chart that shows a breakdown of the different types of posts, a word cloud of the most common words used in post comments (excluding stop words), and a bubble chart of the top five users who posted the most comments (with each bubble's width reflecting their total words used). Buttons at the top of the Stats page allow the user to toggle between different time periods of data (e.g., past hour, past day, past week) to fetch data from the API. ## Setup 1. Clone this repository on your server. 2. Install requirements by running `pip install -r requirements.txt`. 3. Create a PostgreSQL database to store Hacker News feed, post, and comment data, as well as a user that has all privileges on your database. 4. Set the following environment variables for the API: * `FLASK_APP` for the Flask application name for your server ("server.py") * `ENV_TYPE` for the environment status (set this to "Dev" for testing or "Prod" for live) * `VIRTUAL_ENV_NAME` for the name of your virtual environment (e.g., 'hn'); this is used to schedule automatic data scrapes of the Hacker News main feed with crontab * `DB_CONNECTION` for the [database URL](http://docs.sqlalchemy.org/en/latest/core/engines.html#database-urls) to connect to your database via SQLAlchemy ORM (i.e., ''); note that if this variable is set to '', sample data will be returned for the endpoints called from the front-end web app only * `DB_NAME` for the name of your database 5. Load the initial database structure by running `alembic upgrade head`. * Note that you might need to add `PYTHONPATH=.` to the beginning of your revision command if Alembic can't find your module. 6. Initialize the database by running `python management.py init_db` to create a custom text dictionary for use in statistic functions, and schedule hourly scrapes of Hacker News (every hour on the half hour) by running `python management.py sched_scrape`. 7. Start the server by running `flask run` (if you are making changes while the server is running, enter `flask run --reload` instead for instant updates). ## Content API To retrieve data for a specific Hacker News post or comment, a client can send a request to the following endpoints. Post and comment data get saved in the database "post" and "comment" tables respectively:

\ **GET** /api/hacker_news/post/[post_id] * Retrieve the latest version of a post from Hacker News scrapes. * Example response body: ```javascript { "comment_count": 0, "created": "Sat, 21 Apr 2018 20:34:06 GMT", "feed_rank": 90, "link": "https://medium.com/@RoyaPak/designing-for-ethics-706efa1a483e", "point_count": 11, "title": "Designing for Ethics: Interview with Sam Woolley of the Digital Intelligence Lab", "type": "article", "username": "benbreen", "website": "medium.com" } ``` \ **GET** /api/hacker_news/comment/[comment_id] * Retrieve the latest version of a comment for a post from Hacker News scrapes. * Example response body: ```javascript { "content": "Developing low-cost wireless soil moisture sensors for agriculture. It's old technology but the proven benefit is huge, and adoption has been really poor (mostly) because of the cost.", "created": "Mon, 23 Apr 2018 20:34:41 GMT", "feed_rank": 50, "level": 0, "parent_comment": null, "post_id": 16905146, "username": "simonrobb" } ``` ## Statistics API To retrieve statistics for the data scraped from Hacker News, a client can send a request to the following endpoints. Statistics are calculated based on the time period specified in the request URL and use pass-through tables that connect the post/comment version to the feeds acquired during that time period: ![Hacker News Database Tables](images/hn-table.png) \ **GET** /api/hacker_news/stats/[time_period]/average_comment_count * Retrieve the average comment count from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). * Example response body: ```javascript 65 ``` \ **GET** /api/hacker_news/stats/[time_period]/average_comment_tree_depth * Retrieve the average comment tree depth from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). * Example response body: ```javascript 2 ``` \ **GET** /api/hacker_news/stats/[time_period]/average_comment_word_count * Retrieve the average comment word count from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). * Example response body: ```javascript 59 ``` \ **GET** /api/hacker_news/stats/[time_period]/average_point_count * Retrieve the average point count from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). * Example response body: ```javascript 154 ``` \ **GET** /api/hacker_news/stats/[time_period]/comments_highest_word_count?count=[count] * Retrieve the comments with the highest word counts from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). Optionally specify the number of comments via the request URL's count query parameter. * Example response body: ```javascript [ { "content": "(Adapted from a deeply-nested too-late-to-notice reply I made on another thread[1], but which I suspect more people would benefit from seeing)I am skeptical that AIs capable of piloting fully driverless cars are coming in the next few years. In the longer term, I'm more optimistic. There are definitely some fundamental breakthroughs which are needed (with regards to causal reasoning etc.) before \"full autonomy\" can happen -- but a lot of money and creativity is being thrown at these problems, and although none of us will know how hard the Hard problem is until after it's been solved, my hunch is that it will yield within this generation.But I think that framing this as an AI problem is not really correct in the first place.Currently car accidents kill about 1.3 million people per year. Given current driving standards, a lot of these fatalities are \"inevitable\". For example: many real-world car-based trolley problems involve driving around a blind curve too fast to react to what's on the other side. You suddenly encounter an array of obstacles: which one do you choose to hit? Or do you (in some cases) minimise global harm by driving yourself off the road? Faced with these kind of choices, people say \"oh, that's easy -- you can instruct autonomous cars to not drive around blind curves faster than they can react\". But in that case, the autonomous car just goes from being the thing that does the hitting to the thing that gets hit (by a human). Either way, people gonna die -- not due to a specific fault in how individual vehicles are controlled, but due to collective flaws in the entire premise of automotive infrastructure.So the problem is that no matter how good the AIs get, as long as they have to interact with humans in any way, they're still going to kill a fair number of people. I sympathise quite a lot with Musk's utilitarian point of view: if AIs are merely better humans, then it shouldn't matter that they still kill a lot of people; the fact that they kill meaningfully fewer people ought to be good enough to prefer them. If this is the basis for fostering a \"climate of acceptance\" then I don't think it would be a bad thing at all.But I don't expect social or legal systems to adopt a pragmatic utilitarian ethos anytime soon!One barrier it that even apart from the sensational aspect of autonomous-vehicle accidents, it's possible to do so much critiquing of them. When a human driver encounters a real-world trolley problem, they generally freeze up, overcorrect, or do something else that doesn't involve much careful calculation. So shit happens, some poor SOB is liable for it, and there's no black-box to audit.In contrast, when an autonomous vehicle kills someone, there will be a cool, calculated, auditable trail of decision-making which led to that outcome. The impulse to second-guess the AV's reasoning -- by regulators, lawyers, politicians, and competitors -- will be irresistible. To the extent that this fosters actual safety improvements, it's certainly a good thing. But it can be really hard to make even honest critiques of these things, because any suggested change needs to be tested against a near-infinite number of scenarios -- and in any case, not all of the critiques will be honest. This will be a huge barrier to adoption.Another barrier is that people's attitudes towards AVs can change how safe they are. Tesla has real data showing that Autopilot makes driving significantly safer. This data isn't wrong. The problem is that this was from a time when Autopilot was being used by people who were relatively uncomfortable with it. This meant that it was being used correctly -- as a second pair of eyes, augmenting those of the driver. That's fine: it's analogous to an aircraft Autopilot when used like that. But the more comfortable people become with Autopilot -- to the point where they start taking naps or climbing into the back seat -- the less safe it becomes. This is the bane of Level 2 and 3 automation: a feedback loop where increasing AV safety/reliability leads to decreasing human attentiveness, leading (perhaps) to a paradoxical overall decrease in safety and reliability.Even Level 4 and 5 automation isn't immune from this kind of feedback loop. It's just externalised: drivers in Mountain View learned that they could drive more aggressively around the Google AVs, which would always give way to avoid a collision. So safer AI driving has led to more dangerous human driving.So my contention is that while the the AVs may become \"good enough\" anytime between, say, now and 20 years from now -- the above sort of problems will be persistent barriers to adoption. These problems can be boiled down to a single word: humans. As long as AVs share a (high-speed) domain with humans, there will be a lot of fatalities, and the AVs will take the blame for this (since humans aren't black-boxed).Nonetheless, I think we will see AVs become very prominent. Here's how:1. Initially, small networks of low-speed (~12mph) Level-4 AVs operating in mixed environments, generally restricted to campus environments, pedestrianised town centres, etc. At that speed, it's possible to operate safely around humans even with reasonably stupid AIs. Think Easymile, 2getthere, and others.2. These networks will become joined-up by fully-segregated higher-speed AV-only right-of-ways, either on existing motorways or in new types of infrastructure (think the Boring Company).3. As these AVs take a greater mode-share, cities will incrementally convert roads into either mixed low-speed or exclusive high-speed. Development patterns will adapt accordingly. It will be a slow process, but after (say) 40-50 years, the cities will be more or less fully autonomous (with most of the streets being low-speed and heavily shared with pedestrians and bicyclists).Note that this scenario is largely insensitive to AI advances, because the real problem that needs to be solved is at the point of human interface.1: https://news.ycombinator.com/item?id=17170739", "created": "Tue, 29 May 2018 17:29:00 GMT", "id": 17181209, "level": 1, "parent_comment": 17179582, "post_id": 17179378, "total_word_count": 989, "username": "nkoren" } ] ``` \ **GET** /api/hacker_news/stats/[time_period]/comment_words?count=[count] * Retrieve the highest-frequency words used in comments from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). Optionally specify the number of words via the request URL's count query parameter. * Example response body: ```javascript [ { "ndoc": 38796, "nentry": 49174, "word": "like" }, { "ndoc": 32368, "nentry": 46670, "word": "people" }, { "ndoc": 32911, "nentry": 43611, "word": "would" }, { "ndoc": 29189, "nentry": 36294, "word": "one" }, { "ndoc": 24834, "nentry": 29899, "word": "think" } ] ``` \ **GET** /api/hacker_news/stats/[time_period]/deepest_comment_tree * Retrieve the deepest comment tree from specified Hacker News scrapes ('hour', 'day', 'week', 'all'). * Example response body: ```javascript { "comment_count": 202, "comment_tree": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "child_comment": { "content": "Our Nexus setup is internal only. For WFH, we have hundreds of folks using a corporate VPN which routes to our office, and then our office routes to our AWS VPC, which is where our Nexus installation lives. I set this configuration up and haven't had any real issues with it, nor do I see any reason to switch between a proxy and npm.If a developer is using an older buggy version of npm that doesn't respect .npmrc and changes a lock file to point back to npmjs.org entries, we deny the PR and ask for it to be fixed. Right now that check is unfortunately manual, but there are plans to automate it. It can be easy to miss at times though, since GitHub often collapses lock files on PR's due to their size.For us, the main purpose of using Nexus as a proxy is to maintain availability and to cache/maintain package versions. If you're using Nexus to make things faster, then you probably shouldn't be using it. If you want faster installs, look into using `npm ci`.", "created": "Tue, 29 May 2018 20:15:00 GMT", "id": 17182568, "username": "acejam" }, "content": "Gradle build scripts can also be written in Kotlin.", "created": "Tue, 29 May 2018 20:13:00 GMT", "id": 17182547, "username": "vorg" }, "content": "Yeah sorry if I wasn't clear, in Gradle the build script is written in Groovy. It is used to build any number of project types.", "created": "Tue, 29 May 2018 17:57:00 GMT", "id": 17181454, "username": "peeters" }, "content": "Just a note: you can use Gradle for Java too. I haven't built a Java project without Gradle since like 2010 or so.", "created": "Tue, 29 May 2018 17:31:00 GMT", "id": 17181212, "username": "_asummers" }, "content": "I spent yesterday trying to get protobuffers working in maven, see http://vlkan.com/blog/post/2015/11/27/maven-protobuf/ for the pain.Anything counter to maven's way is a PITA. In this case, Maven dislikes platform specific binaries.", "created": "Tue, 29 May 2018 16:35:00 GMT", "id": 17180672, "username": "tlarkworthy" }, "content": "Wouldn't that be more of an issue of not having an SLA between SaaS and client?If the silly non-automated dashboard is part of the SLA, then it costs someone money/liability/trust to not maintain it, otherwise \"who cares as long the issue gets resolved, people who care about the issue are tracking the bug report?\"", "created": "Tue, 29 May 2018 12:47:00 GMT", "id": 17178596, "username": "nonconvergent" }, "content": "I don't consider a non working status page trivial. Yes, if a SaaSs send me into a rage then there probably dozens of red flags already, yes I will work with clients to drop or replace such SaaS with one that can communicate, absolutely as it usually falls under my devop remit. This doesn't apply to npm as not a paid Iaas/SaaS, but more to point out the number of shit SaaS that don't manage their status updates probably. Imgix in the past for example, Linode when Ddos - absolutely shambolic communication, and so on", "created": "Tue, 29 May 2018 10:41:00 GMT", "id": 17177924, "username": "sitepodmatt" }, "content": "You force your clients to do things for trivial reasons that send you info a rage?", "created": "Tue, 29 May 2018 08:10:00 GMT", "id": 17177264, "username": "mattmanser" }, "content": "This is why I make clients drop/replace SaaSs. When SaaS don't update status page, or in a transparent and prompt way, because it's not 100% outage it rages me, especially if the reports are found elsewhere - twitter/reddit/hn.", "created": "Tue, 29 May 2018 06:20:00 GMT", "id": 17176882, "username": "sitepodmatt" }, "content": "The entire traffic light metaphor of status pages and dashboards is questionable IMHO.", "created": "Tue, 29 May 2018 06:13:00 GMT", "id": 17176858, "username": "tannhaeuser" }, "content": "2 hours after the incident was responded to by an npm employee and the status is still green: https://status.npmjs.org/I love fake status pages!", "created": "Tue, 29 May 2018 04:30:00 GMT", "id": 17176256, "username": "iends" }, "created": "Tue, 29 May 2018 03:37:00 GMT", "fe ... ...

近期下载者

相关文件


收藏者