说明: HN_SO_analysis,给定技术在堆栈溢出(SO)和黑客新闻(HN)上的流行程度之间有关系吗...
(Is there a relationship between popularity of a given technology on Stack Overflow (SO) and Hacker News (HN)? And a few words about causality)
Is there a relationship between popularity of a given technology on
Stack Overflow (SO) and Hacker News (HN)? And a few words about
causality
================
dgwozdz
3
rd JUNE 2018
Last update: 08
th JULY 2018
## Table of Contents
1) [Introduction](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#intro)
2) [How to cope with problem](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#suggested-solutions)
3) [Exploratory Data Analysis](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#eda)
4) [Granger causality](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#granger)
5) [Summary](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#summary)
6) [Further research](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#further)
7) [Acknowledgments](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#acknowledgments)
-----
## 1\. Introduction
[Stack Overflow](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://stackoverflow.com) and [Hacker
News](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://news.ycombinator.com/) are portals mainly (but not only)
read and used by programmers and other people who occupy their
(professional or free) time with writing code.
### Stack Overflow (hereafter referenced as SO)
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/logos/stack.png)
Stack Overflow lets their users easily Stack Overflow (SO), an
established in 2008 portal on which programmers help each other by
asking and answering coding questions, lets their users easily find
questions related to a certain programming language/framework/library
etc. by tags. The questions and replies/comments are evaluated in a form
of points so it is usually instantly obvious which answer was rated the
highest (and therefore is considered as the best one by the community)
or whether a described problem is reproducible, i.e. you can replicate
it with a piece of code prepared by a person asking a question.
### Hacker News (hereafter referenced as HN)
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/logos/hn.png)
Hacker News (HN) is an established in 2007 portal on which users submit
interesting links/stories. Those stories gather points just like
questions on SO (however, users cannot downvote stories until they reach
a certain karma treshold). Each post can be commmented on.
## Causality and relationship
Does one site influence the other? They could but the factors behind it
– the *causes* would rather be external. Let’s see a definition of
causality:
***Causality** (also referred to as causation, or cause and effect) is
what connects one process (the cause) with another process or state (the
effect), where the first is partly responsible for the second, and the
second is partly dependent on the first. In general, a process has many
causes, which are said to be causal factors for it, and all lie in its
past.*
[1](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#f1)
Causality is a phenomenon which people intuitively understand but which
is tough to measure with statistical methods. Let’s say you would like
to build a model which explains the behaviour of defaults on mortgages.
You may use the following data explaining a lack of repayments: clients’
incomes, GDP, clients’ heights, their genders etc. You employ a
regression/decision tree/neural network and it seems that the best
predictors are sex and height. Does that mean that those variables
influence defaults? It may just be a spurious correlation. Those
variables may include other information, not directly implied by
themselves (e.g. that there is a gap in salaries between genders). You
may have not considered other important variables (e.g. number of kids,
length of employment, the history of previous loans if it’s available),
so the data you put in your model are statistically significant. Or,
because such a situation is also possible, the significant variables do
indeed influence defaults but you just don’t understand the dependence
of those phenomena. The causality seems to be a tough thing to identify.
In the context of HO and SN, the factors influencing both sites could
be, for example, the following:
1. Both sites (to some extend) share user base: programmers or people
writing code (for whatever reason). Some of these users check Hacker
News to be informed about new technologies but they visit SO if they
have specific problem.
2. The initial popularity of a given topic/programming language on both
sites can be driven by a company which produced it, like in case of
Swift (which was developed by Apple).
3. Some users may be interested specifically in their language(s) of
choice and read posts about them on HN.
4. Popularity of a given technology on both sites can be driven by
other factors:
1. meetups/conferences,
2. the availability of errorless documentation
3. size of current userbase,
4. materials to learn: books, courses, tutorials, blog
posts.
## 2\. How tackle the problem
### Popularity
This article examines data from SO and HN trying to answer the question
whether an intuitive relationship between the two is reflected in
available data. It is a combination of Exploratory Data Analysis (EDA)
with some descriptive statistics. It does not try to further investigate
or quantify the above mentioned factors influencing both Stack Overflow
and Hacker News.
When dealing with the problem of relationship, an operationalization of
variables to investigate is needed. Here, the phenomenon which was
examined was described as “popularity of a given technology”. What is it
and how to measure it in the context of SO and HN? **Popularity** could
be defined as liking or attraction to a certain person, an idea or, in
our case, a technology. In the context of this analysis (SO and HN), you
can name at least a few metrics by which you could measure whether
something is popular or not. Additionally, they should be measured in a
certain time unit, e.g. daily. If data that we would like to analyse can
be identified by an ordered time index (the unit is irrelevant) that
means that we deal with a **time series**.
I suggest starting the analysis with available empirical variables such
as a number of questions (SO) / posts (HN) for a given programming
language and points gathered by those questions (SO) / topics (HN).
Those two are probably the most universal ones. Of course, you could
come up with much more variables, for example:
1) number of times questions from a certain time span (e.g. from a
given day) were tagged as a favourite,
2) number of comments for questions from a certain life span,
3) number of views of questions from a certain life span,
4) number of replies for questions from a certain life span.
However, there is a small problem with the available variable c): the
number of views seems to be irreproducible. SO shows only the number of
views a given question gathered by today’s date, so obtaining this
variable from different time points (e.g. from 1
st June 2018
and 2
nd June 2018) results in different values. DUe to this
fact, the variable, althought interesting, migth lead to irreproducible
results.
When it comes to obtaining data from Satck Oveflow, it was easy to
identyify which question is assigned to which technology. Every question
has its tags. The process of data preprocessing was a bit tougher in
case of Hacker News. The definition of a topic related to a certain
technology is a topic, in which the name of this technology appears
either in a title or in the text (comments were not taken into
consideration).
The selected variables: number of questions/topics/point were usually
analysed in four pairs:
1) number of questions on SO vs. number of topics on HN,
2) number of questions on SO vs. number of points on HN,
3) number of points on SO vs. number of topics on HN,
4) number of points on SO vs. number of points on HN.
I have previously written that all variables should be measured in a
certain time unit, which leads to the next issue: how to aggregate data
from a certain period? I decided to use a sum as an aggregation
function, e.g. sum of questions which appeared in a certain day. You
could come up with, for example, an average. However, the problem with
such a metric could be small samples on the basis of which it would be
computed (for example, a mean number of points gathered by questions on
SO from a given day when during that 24 hours only one or two posts
popped up), which would be unrepresentative.
How to cope with the problem of validating a relatiship between two
phenomena? The **first approach** could be an EDA - **Exploratory Data
Analysis**, which basically means producing some plots and trying to
indicate something from them. The plus of this solution is the visual
aspect: you can clearly see the trend of (or a lack of thereof) of a
popularity for a given programming language and for most people it is
easier to read plots than just bare tables. The somewhat hindering side
of the method to unravel causality is its qualitative character - there
is no statistic/test indicating whether your conclusions on the basis of
plots are correct or not.
The **second approach** a qualitative one: a **Granger causality**.
*Wait a minute,* you may ask, *there’s a specific type of causality?*
Basically saying, yup. Granger causality, proposed in 1969, determines
whether one time series is helpful in forecasting another time series.
Therefore, this type of causality is called a *predictive causality*.
Note that the question: *Does one phenomenon is a cause of another one?*
is different from what Granger causality measures: here you only use
past values of a given variable and try to use them to forecast the
future values of another phenomenon, just like building a forecasting
model. That means that **Granger causality is not and does not indicate
causality between two phenomena**. Nonetheless, it may indicate a
relationship either resulting from the third factor influencing the two
observed ones or that one variable is really an effect of the other. It
is nonetheless impossible to identify it on the basis of the Granger
causality test itself.
### Data
Data from this analysis, comes from two sources: (quite obviously) Stack
Overflow and Kaggle (surprise\!). [Kaggle](https://github.com/dgwozdz/HN_SO_analysis/blob/master/https://www.kaggle.com/) is a
website which organizes competitons for data scientists/analysts which
goal is to build the best predictive model for a given phenomenom based
on shared data set(s). It also provides some data sets which are not
strictly for competitions but can be used in EDA (Exploratory Data
Analysis). The data utilised in this analysis with regard to Hacker News
comes from [the latter](https://github.com/dgwozdz/HN_SO_analysis/blob/master/http://kaggle.com/hacker-news/hacker-news).
Variables regarding Stack Overflow comes from queries utilized in [Stack
Exchange Data Explorer](https://github.com/dgwozdz/HN_SO_analysis/blob/master/http://data.stackexchange.com/stackoverflow)
which allows anyone interested to write SQL queries for Stack Overflow
as well as other Stack databases. The data were gathered from the period
**15
th September 2008 - 31
st December 2017.**
The programming languages or technologies which were examined include:
C, C++, C\#, Cobol, CSS, D3.js, R, Delphi, Fortran, Hadoop, HTML, Java,
Javascript, JQuery, Pascal, Perl, Python, PHP, Ruby, Rust, Scala, Shell,
Spark, SQL, Swift, Tensorflow, VBA. The choice of technologies was
arbitrary.
The data from portals were assigned specific colors:
grey for Stack Overflow,
orange for Hacker News. Those colours
are consistent with the ones used later on so that it would be easier to
identify the source of data (Stack Overflow or Hacker News).
## 3\. Exploratory Data Analysis
One of the ideas with regard to examining causality included checking
cumulative plots. Cumulative plots show aggregated value of a given
measure to a given date. For example, if we have data in such a form:
| Date | Value |
| ---------- | :---: |
| 2018-01-01 | 1 |
| 2018-01-02 | 2 |
| 2018-01-03 | 3 |
Then the cumulative value would be sum of all the values up to a given
date:
| Date | Value | Cumulative value |
| ---------- | :---: | ---------------: |
| 2018-01-01 | 1 | 1 |
| 2018-01-02 | 2 | 3 |
| 2018-01-03 | 3 | 6 |
### 3.0 C\# [](https://github.com/dgwozdz/HN_SO_analysis/blob/master/#)
Exemplifying, the plot below shows:
1) cumulative number of questions asked for C\# on
Stack Overflow to the date on x axis
(
grey line),
2) cumulative number of points gathered by all topic with regard to C\#
on
Hacker News to the date on x
axis (
orange line).
It can be noticed that by the end of 2017 the cumulative number of
questions on
SO exceeded 50 thousand
while in the same time the number of points for topics with C\# on
HN reached about 30
thousand.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_so_usage_cnt_cum_hn_all_match_score_cum.png)
Above described plot for C\# does not seem particularly interesting. It
shows (rather obviously) an upward trends for both variables, however,
the dynamic for them is different. What is more, it would be nice to see
a standardized variables, for example in such a way that they both start
at 0 and end at 1. Thanks to such a data transformation technique it is
possible to find time series which may be similar in terms of behaviour
in time but different when it comes to differences by which they
increase (or decrease). The plot for the same phenomena and technology
as above but with standardized variables is presented
below:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_stand_so_usage_cnt_cum_stand_hn_all_match_score_cum.png)
Now it can be noticed that the cumulative number of questions on
SO and the cumulative number of points
for topics on
HN show strong
resemblance. Similar resemblance can be seen when comparing standardized
plots of cumulative number of questions vs. cumulative number of
topics:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_sharp_so_usage_cnt_cum_hn_all_match_cnt_cum_double.png)
Let’s see some interesting similarities between statistics on
SO and
HN for different technologies on
standardized plots. We will only see the technologies for which I
identified somme sort of similarity between data from
SO and
HN or for which I discovered something
interesting. Additionally, plots on the left will be the ones for
standardised variables while those on the right for variables without
transformation (standardisation).
### 3.1 C
Similarly to C\#, there is visible resemblance between cumulative number
of questions on
SO and cumulative number
of points for topics on
HN:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_so_usage_cnt_cum_hn_all_score_sum_cum_double.png)
as well as in case of cumulative number of questions on
SO and cumulative number of topics on
HN:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c_so_usage_cnt_cum_hn_all_cnt_cum_double.png)
### 3.2 C++
Not surprisingly, in case of C++ the similarities existing for C and C\#
are
repeated.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_c++_hn_all_match_score_cum_s_hn_all_match_score_cum_hn_all_match_cnt_cum_s_hn_all_match_cnt_cum_d_since2008-09-16.png)
### 3.3 Cobol
Yes, Cobol. I wanted to say (write) *good, old Cobol*, however, I find
such statement to be a little exaggeration for this, I would say
*antique* programming language. If you don’t know what Cobol is, its
name is an acronym from *common business-oriented language*, which
resembles high similarity to English language. The aim of such a design
was to be readable for both programmers and non-technical staff like
managers. It was introduced in 1959(\!) and was/is used in variety of
environments, including banking and insurance.
The plot below shows a high resemblance of cumulative number of
questions on
SO and cumulative number of
topics with regard to this programming language on
HN.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_cobol_so_usage_cnt_cum_hn_all_cnt_cum_double.png)
It is worth noticing that in a span of about 9 years less than 300
questions appeared on
SO, which shows
low or even lack of popularity of this technology nowadays.
### 3.4 CSS
Similar resemblance is observed between the cumulative number of
questions on
SO and the cumulative
number of points gathered by the topics on
HN.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_css_so_usage_cnt_cum_hn_all_score_cum_double.png)
.and between the cumulative number of questions on
SO and cumulative number of topics on
HN:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_css_so_usage_cnt_cum_hn_all_cnt_cum_double.png)
### 3.5 D3.js
In case of Javascript visualization library D3.js a resemblance is
observed between the cumulative number of points obtained by questions
on
SO and the cumulative number of
points gathered by topics on
HN:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_d3js_so_usage_cnt_cum_hn_all_score_cum_double.png)
### 3.6 Delphi
For Delphi the cumulative number of questions on
SO seems to follow the same trend as the
cumulative number of topics on
Hacker
News:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_delphi_so_usage_cnt_cum_hn_all_cnt_cum_double.png)
### 3.7 Fortran
When it comes to Fortran, the similarity is observed between the
cumulative number of questions on
SO and
the cumulative number of topics on
HN.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_fortran_so_usage_cnt_cum_hn_all_cnt_cum_double.png)
Similarly to Cobol, this programming gathered only 700 question in over
9 years which indicates its unpopularity.
### 3.8 Hadoop
In case of Hadoop, the cumulative number of questions on
SO seems to be similar to the cumulative
number of points on
HN. What’s
interesting here is the change of dynamic in 2013: since the middle of
this year the number of question on
SO
grows faster than the number of points on
HN.
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_hadoop_so_score_sum_cum_hn_all_score_sum_cum_double.png)
### 3.9 HTML
In case of HTML there was no resemblance between variables.
Nevertheless, the interesting fact is that since 2014 the number of
point for questions on
SO stabilizes and
later slightly decreases by about 5%, which is shown on the plot
below:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_html_so_score_sum_cum_hn_all_score_sum_cum_double.png)
Such a situation occurred due to greater number of downvotes than
upvotes. This may be result of high number of duplicates since 2014 or
the questions which were not formulated in a clear way or were not
reproducible (and therefore were downvoted).
### 3.10 Java
In case of Java the resemblance is visible between the cumulative number
of questions on
SO and the cumulative
number of topics on
HN:
![](https://github.com/dgwozdz/HN_SO_analysis/blob/master/readme_vis/plots/20180602_java_so_cnt_sum_cum_hn_all_cnt_sum_cum_double.png)
Like in the case of HTML, here the cumulative number of pointsalso
levels off in 2014 to fall slightly by the end of 2015. The significan ... ...