HN_SO_analysis 联合开发网

Pudn.com > 下载中心 > 大数据 > HN_SO_analysis

HN_SO_analysis

python exploratory-data-analysis stackoverflow EDA hackernews

所属分类：大数据
开发工具：Python
文件大小：17142KB
下载次数：0
上传日期：2018-07-08 14:11:10
上传者：sh-1993

说明： HN_SO_analysis，给定技术在堆栈溢出（SO）和黑客新闻（HN）上的流行程度之间有关系吗...
(Is there a relationship between popularity of a given technology on Stack Overflow (SO) and Hacker News (HN)? And a few words about causality)

文件列表:

LICENSE (1064, 2018-07-08)
codes (0, 2018-07-08)
codes\00_main.py (42438, 2018-07-08)
codes\__pycache__ (0, 2018-07-08)
codes\__pycache__\calc_granger_causality.cpython-36.pyc (2700, 2018-07-08)
codes\__pycache__\data_min_date.cpython-36.pyc (838, 2018-07-08)
codes\__pycache__\diff_nonstationary.cpython-36.pyc (878, 2018-07-08)
codes\__pycache__\grangercausalitytests_mod.cpython-36.pyc (2313, 2018-07-08)
codes\__pycache__\hn_plots.cpython-36.pyc (4135, 2018-07-08)
codes\__pycache__\sel_data_min_date.cpython-36.pyc (823, 2018-07-08)
codes\__pycache__\useful.cpython-36.pyc (632, 2018-07-08)
codes\calc_granger_causality.py (6212, 2018-07-08)
codes\diff_nonstationary.py (913, 2018-07-08)
codes\grangercausalitytests_mod.py (4038, 2018-07-08)
codes\hn_plots.py (8051, 2018-07-08)
codes\kaggle_data.py (1955, 2018-07-08)
codes\old (0, 2018-07-08)
codes\old\kaggle_d3js_data_20180414_1511.py (1662, 2018-07-08)
codes\old\kaggle_data_20180319.py (1861, 2018-07-08)
codes\sel_data_min_date.py (1245, 2018-07-08)
codes\stack_queries.sql (2078, 2018-07-08)
codes\useful.py (314, 2018-07-08)
hacker_news_analysis (0, 2018-07-08)
kaggle_data (0, 2018-07-08)
kaggle_data\kaggle_data_20180414_1358.csv (6057830, 2018-07-08)
kaggle_data\old (0, 2018-07-08)
kaggle_data\old\kaggle_data_20180319.csv (5827438, 2018-07-08)
kaggle_data\old\kaggle_data_20180403.csv (5775899, 2018-07-08)
kaggle_data\old\kaggle_data_part1.csv (4358891, 2018-07-08)
kaggle_data\old\tech_per_day.csv (7608053, 2018-07-08)
plots (0, 2018-07-08)
... ...

Is there a relationship between popularity of a given technology on Stack Overflow (SO) and Hacker News (HN)? And a few words about causality ================ dgwozdz
3^rd JUNE 2018 Last update: 08^th JULY 2018 ## Table of Contents 1) [Introduction](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#intro) 2) [How to cope with problem](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#suggested-solutions) 3) [Exploratory Data Analysis](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#eda) 4) [Granger causality](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#granger) 5) [Summary](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#summary) 6) [Further research](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#further) 7) [Acknowledgments](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#acknowledgments) ----- ## 1\. Introduction [Stack Overflow](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://stackoverflow.com) and [Hacker News](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://news.ycombinator.com/) are portals mainly (but not only) read and used by programmers and other people who occupy their (professional or free) time with writing code. ### Stack Overflow (hereafter referenced as SO) ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/logos/stack.png) Stack Overflow lets their users easily Stack Overflow (SO), an established in 2008 portal on which programmers help each other by asking and answering coding questions, lets their users easily find questions related to a certain programming language/framework/library etc. by tags. The questions and replies/comments are evaluated in a form of points so it is usually instantly obvious which answer was rated the highest (and therefore is considered as the best one by the community) or whether a described problem is reproducible, i.e. you can replicate it with a piece of code prepared by a person asking a question. ### Hacker News (hereafter referenced as HN) ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/logos/hn.png) Hacker News (HN) is an established in 2007 portal on which users submit interesting links/stories. Those stories gather points just like questions on SO (however, users cannot downvote stories until they reach a certain karma treshold). Each post can be commmented on. ## Causality and relationship Does one site influence the other? They could but the factors behind it – the *causes* would rather be external. Let’s see a definition of causality: ***Causality** (also referred to as causation, or cause and effect) is what connects one process (the cause) with another process or state (the effect), where the first is partly responsible for the second, and the second is partly dependent on the first. In general, a process has many causes, which are said to be causal factors for it, and all lie in its past.*^{[1](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#f1)} Causality is a phenomenon which people intuitively understand but which is tough to measure with statistical methods. Let’s say you would like to build a model which explains the behaviour of defaults on mortgages. You may use the following data explaining a lack of repayments: clients’ incomes, GDP, clients’ heights, their genders etc. You employ a regression/decision tree/neural network and it seems that the best predictors are sex and height. Does that mean that those variables influence defaults? It may just be a spurious correlation. Those variables may include other information, not directly implied by themselves (e.g. that there is a gap in salaries between genders). You may have not considered other important variables (e.g. number of kids, length of employment, the history of previous loans if it’s available), so the data you put in your model are statistically significant. Or, because such a situation is also possible, the significant variables do indeed influence defaults but you just don’t understand the dependence of those phenomena. The causality seems to be a tough thing to identify. In the context of HO and SN, the factors influencing both sites could be, for example, the following: 1. Both sites (to some extend) share user base: programmers or people writing code (for whatever reason). Some of these users check Hacker News to be informed about new technologies but they visit SO if they have specific problem. 2. The initial popularity of a given topic/programming language on both sites can be driven by a company which produced it, like in case of Swift (which was developed by Apple). 3. Some users may be interested specifically in their language(s) of choice and read posts about them on HN. 4. Popularity of a given technology on both sites can be driven by other factors: 1. meetups/conferences, 2. the availability of errorless documentation 3. size of current userbase, 4. materials to learn: books, courses, tutorials, blog posts. ## 2\. How tackle the problem ### Popularity This article examines data from SO and HN trying to answer the question whether an intuitive relationship between the two is reflected in available data. It is a combination of Exploratory Data Analysis (EDA) with some descriptive statistics. It does not try to further investigate or quantify the above mentioned factors influencing both Stack Overflow and Hacker News. When dealing with the problem of relationship, an operationalization of variables to investigate is needed. Here, the phenomenon which was examined was described as “popularity of a given technology”. What is it and how to measure it in the context of SO and HN? **Popularity** could be defined as liking or attraction to a certain person, an idea or, in our case, a technology. In the context of this analysis (SO and HN), you can name at least a few metrics by which you could measure whether something is popular or not. Additionally, they should be measured in a certain time unit, e.g. daily. If data that we would like to analyse can be identified by an ordered time index (the unit is irrelevant) that means that we deal with a **time series**. I suggest starting the analysis with available empirical variables such as a number of questions (SO) / posts (HN) for a given programming language and points gathered by those questions (SO) / topics (HN). Those two are probably the most universal ones. Of course, you could come up with much more variables, for example: 1) number of times questions from a certain time span (e.g. from a given day) were tagged as a favourite, 2) number of comments for questions from a certain life span, 3) number of views of questions from a certain life span, 4) number of replies for questions from a certain life span. However, there is a small problem with the available variable c): the number of views seems to be irreproducible. SO shows only the number of views a given question gathered by today’s date, so obtaining this variable from different time points (e.g. from 1^st June 2018 and 2^nd June 2018) results in different values. DUe to this fact, the variable, althought interesting, migth lead to irreproducible results. When it comes to obtaining data from Satck Oveflow, it was easy to identyify which question is assigned to which technology. Every question has its tags. The process of data preprocessing was a bit tougher in case of Hacker News. The definition of a topic related to a certain technology is a topic, in which the name of this technology appears either in a title or in the text (comments were not taken into consideration). The selected variables: number of questions/topics/point were usually analysed in four pairs: 1) number of questions on SO vs. number of topics on HN, 2) number of questions on SO vs. number of points on HN, 3) number of points on SO vs. number of topics on HN, 4) number of points on SO vs. number of points on HN. I have previously written that all variables should be measured in a certain time unit, which leads to the next issue: how to aggregate data from a certain period? I decided to use a sum as an aggregation function, e.g. sum of questions which appeared in a certain day. You could come up with, for example, an average. However, the problem with such a metric could be small samples on the basis of which it would be computed (for example, a mean number of points gathered by questions on SO from a given day when during that 24 hours only one or two posts popped up), which would be unrepresentative. How to cope with the problem of validating a relatiship between two phenomena? The **first approach** could be an EDA - **Exploratory Data Analysis**, which basically means producing some plots and trying to indicate something from them. The plus of this solution is the visual aspect: you can clearly see the trend of (or a lack of thereof) of a popularity for a given programming language and for most people it is easier to read plots than just bare tables. The somewhat hindering side of the method to unravel causality is its qualitative character - there is no statistic/test indicating whether your conclusions on the basis of plots are correct or not. The **second approach** a qualitative one: a **Granger causality**. *Wait a minute,* you may ask, *there’s a specific type of causality?* Basically saying, yup. Granger causality, proposed in 1969, determines whether one time series is helpful in forecasting another time series. Therefore, this type of causality is called a *predictive causality*. Note that the question: *Does one phenomenon is a cause of another one?* is different from what Granger causality measures: here you only use past values of a given variable and try to use them to forecast the future values of another phenomenon, just like building a forecasting model. That means that **Granger causality is not and does not indicate causality between two phenomena**. Nonetheless, it may indicate a relationship either resulting from the third factor influencing the two observed ones or that one variable is really an effect of the other. It is nonetheless impossible to identify it on the basis of the Granger causality test itself. ### Data Data from this analysis, comes from two sources: (quite obviously) Stack Overflow and Kaggle (surprise\!). [Kaggle](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://www.kaggle.com/) is a website which organizes competitons for data scientists/analysts which goal is to build the best predictive model for a given phenomenom based on shared data set(s). It also provides some data sets which are not strictly for competitions but can be used in EDA (Exploratory Data Analysis). The data utilised in this analysis with regard to Hacker News comes from [the latter](https://github.com/dgwozdz/HN_SO_analysis/blob/master/http://kaggle.com/hacker-news/hacker-news). Variables regarding Stack Overflow comes from queries utilized in [Stack Exchange Data Explorer](https://github.com/dgwozdz/HN_SO_analysis/blob/master/http://data.stackexchange.com/stackoverflow) which allows anyone interested to write SQL queries for Stack Overflow as well as other Stack databases. The data were gathered from the period **15^th September 2008 - 31^st December 2017.** The programming languages or technologies which were examined include: C, C++, C\#, Cobol, CSS, D3.js, R, Delphi, Fortran, Hadoop, HTML, Java, Javascript, JQuery, Pascal, Perl, Python, PHP, Ruby, Rust, Scala, Shell, Spark, SQL, Swift, Tensorflow, VBA. The choice of technologies was arbitrary. The data from portals were assigned specific colors: grey for Stack Overflow, orange for Hacker News. Those colours are consistent with the ones used later on so that it would be easier to identify the source of data (Stack Overflow or Hacker News). ## 3\. Exploratory Data Analysis One of the ideas with regard to examining causality included checking cumulative plots. Cumulative plots show aggregated value of a given measure to a given date. For example, if we have data in such a form: | Date | Value | | ---------- | :---: | | 2018-01-01 | 1 | | 2018-01-02 | 2 | | 2018-01-03 | 3 | Then the cumulative value would be sum of all the values up to a given date: | Date | Value | Cumulative value | | ---------- | :---: | ---------------: | | 2018-01-01 | 1 | 1 | | 2018-01-02 | 2 | 3 | | 2018-01-03 | 3 | 6 | ### 3.0 C\# [](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#) Exemplifying, the plot below shows: 1) cumulative number of questions asked for C\# on Stack Overflow to the date on x axis (grey line), 2) cumulative number of points gathered by all topic with regard to C\# on Hacker News to the date on x axis (orange line). It can be noticed that by the end of 2017 the cumulative number of questions on SO exceeded 50 thousand while in the same time the number of points for topics with C\# on HN reached about 30 thousand. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_so_usage_cnt_cum_hn_all_match_score_cum.png) Above described plot for C\# does not seem particularly interesting. It shows (rather obviously) an upward trends for both variables, however, the dynamic for them is different. What is more, it would be nice to see a standardized variables, for example in such a way that they both start at 0 and end at 1. Thanks to such a data transformation technique it is possible to find time series which may be similar in terms of behaviour in time but different when it comes to differences by which they increase (or decrease). The plot for the same phenomena and technology as above but with standardized variables is presented below: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_stand_so_usage_cnt_cum_stand_hn_all_match_score_cum.png) Now it can be noticed that the cumulative number of questions on SO and the cumulative number of points for topics on HN show strong resemblance. Similar resemblance can be seen when comparing standardized plots of cumulative number of questions vs. cumulative number of topics: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_so_usage_cnt_cum_hn_all_match_cnt_cum_double.png) Let’s see some interesting similarities between statistics on SO and HN for different technologies on standardized plots. We will only see the technologies for which I identified somme sort of similarity between data from SO and HN or for which I discovered something interesting. Additionally, plots on the left will be the ones for standardised variables while those on the right for variables without transformation (standardisation). ### 3.1 C Similarly to C\#, there is visible resemblance between cumulative number of questions on SO and cumulative number of points for topics on HN: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_so_usage_cnt_cum_hn_all_score_sum_cum_double.png) as well as in case of cumulative number of questions on SO and cumulative number of topics on HN: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_so_usage_cnt_cum_hn_all_cnt_cum_double.png) ### 3.2 C++ Not surprisingly, in case of C++ the similarities existing for C and C\# are repeated. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c++_hn_all_match_score_cum_s_hn_all_match_score_cum_hn_all_match_cnt_cum_s_hn_all_match_cnt_cum_d_since2008-09-16.png) ### 3.3 Cobol Yes, Cobol. I wanted to say (write) *good, old Cobol*, however, I find such statement to be a little exaggeration for this, I would say *antique* programming language. If you don’t know what Cobol is, its name is an acronym from *common business-oriented language*, which resembles high similarity to English language. The aim of such a design was to be readable for both programmers and non-technical staff like managers. It was introduced in 1959(\!) and was/is used in variety of environments, including banking and insurance. The plot below shows a high resemblance of cumulative number of questions on SO and cumulative number of topics with regard to this programming language on HN. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_cobol_so_usage_cnt_cum_hn_all_cnt_cum_double.png) It is worth noticing that in a span of about 9 years less than 300 questions appeared on SO, which shows low or even lack of popularity of this technology nowadays. ### 3.4 CSS Similar resemblance is observed between the cumulative number of questions on SO and the cumulative number of points gathered by the topics on HN. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_css_so_usage_cnt_cum_hn_all_score_cum_double.png) .and between the cumulative number of questions on SO and cumulative number of topics on HN: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_css_so_usage_cnt_cum_hn_all_cnt_cum_double.png) ### 3.5 D3.js In case of Javascript visualization library D3.js a resemblance is observed between the cumulative number of points obtained by questions on SO and the cumulative number of points gathered by topics on HN: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_d3js_so_usage_cnt_cum_hn_all_score_cum_double.png) ### 3.6 Delphi For Delphi the cumulative number of questions on SO seems to follow the same trend as the cumulative number of topics on Hacker News: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_delphi_so_usage_cnt_cum_hn_all_cnt_cum_double.png) ### 3.7 Fortran When it comes to Fortran, the similarity is observed between the cumulative number of questions on SO and the cumulative number of topics on HN. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_fortran_so_usage_cnt_cum_hn_all_cnt_cum_double.png) Similarly to Cobol, this programming gathered only 700 question in over 9 years which indicates its unpopularity. ### 3.8 Hadoop In case of Hadoop, the cumulative number of questions on SO seems to be similar to the cumulative number of points on HN. What’s interesting here is the change of dynamic in 2013: since the middle of this year the number of question on SO grows faster than the number of points on HN. ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_hadoop_so_score_sum_cum_hn_all_score_sum_cum_double.png) ### 3.9 HTML In case of HTML there was no resemblance between variables. Nevertheless, the interesting fact is that since 2014 the number of point for questions on SO stabilizes and later slightly decreases by about 5%, which is shown on the plot below: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_html_so_score_sum_cum_hn_all_score_sum_cum_double.png) Such a situation occurred due to greater number of downvotes than upvotes. This may be result of high number of duplicates since 2014 or the questions which were not formulated in a clear way or were not reproducible (and therefore were downvoted). ### 3.10 Java In case of Java the resemblance is visible between the cumulative number of questions on SO and the cumulative number of topics on HN: ![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_java_so_cnt_sum_cum_hn_all_cnt_sum_cum_double.png) Like in the case of HTML, here the cumulative number of pointsalso levels off in 2014 to fall slightly by the end of 2015. The significan ... ...

近期下载者：

相关文件：

评论：[我要评论] [举报此文件]

收藏者：