[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/philippbayer/Goodreads_visualization/master?filepath=README.ipynb)
# Goodreads visualization
A Jupyter notebook to play around with Goodreads data and make some seaborn visualizations, learn more about scikit-learn, my own playground!
You can use it with your own data - go [here](https://www.goodreads.com/review/import) and press "Export your library" to get your own csv.
The text you're reading is generated from a jupyter notebook by the Makefile. If you want to run it yourself, clone the repository then run
jupyter notebook your_file.ipynb
to get the interactive version. In there, replace the path to my Goodreads exported file by yours in the ipynb file, and then run click on Cell -> Run All.
** WARNING **
It seems that there's currently a bug on Goodreads' end with the export of data, as many recently 'read' books have a read-date which is shown on the web page but doesn't show up in the CSV.
## Dependencies
* Python (3! rpy2 doesn't work under Python2 any more)
* Jupyter
* R (for rpy2)
### Python packages
* seaborn
* pandas
* wordcloud
* nltk
* networkx
* pymarkovchain
* scikit-learn
* distance
* image (PIL inside python for some weird reason)
* gender_guesser
* rpy2
To install all:
pip install seaborn wordcloud nltk networkx pymarkovchain image sklearn distance gender_guesser rpy2
Under Windows and anaconda you instead need to run
conda install rpy2
instead of using pip to install rpy2.
## Licenses
License for reviews: CC-BY-SA 4.0
Code: MIT
OK, let's start!
## Setting up the notebook
```python
%pylab inline
# for most plots
import numpy as np
import pandas as pd
import seaborn as sns
from collections import defaultdict, Counter, OrderedDict
# for stats
import scipy.stats
# for time-related plots
import datetime
import calendar
# for word cloud
import re
import string
from nltk.corpus import stopwords
from wordcloud import WordCloud
# for Markov chain
from pymarkovchain import MarkovChain
import pickle
import networkx as nx
# for shelf clustering
import distance
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN
sns.set_palette("coolwarm")
# for plotting images
from IPython.display import Image
import gender_guesser.detector as gender
# for R
import pandas
from rpy2 import robjects
# conda install -c r rpy2 on Windows
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = [10, 5]
```
Populating the interactive namespace from numpy and matplotlib
## Loading the data
```python
df = pd.read_csv('./goodreads_library_export.csv')
# keep only books that have a rating (unrated books have a rating of 0, we don't need that)
cleaned_df = df[df["My Rating"] != 0]
# get rid of noise in 2012
cleaned_df = cleaned_df[(cleaned_df['Date Added'] > '2013-01-01')]
```
# Score distribution
With a score scale of 1-5, you'd expect that the average score is ~~2.5~~ 3 (since 0 is not counted) after a few hundred books (in other words, is it a normal distribution?)
```python
g = sns.distplot(cleaned_df["My Rating"], kde=False)
"Average: %.2f"%cleaned_df["My Rating"].mean(), "Median: %s"%cleaned_df["My Rating"].median()
```
('Average: 3.54', 'Median: 4.0')
![png](README_files/README_5_1.png)
That doesn't look normally distributed to me - let's ask Shapiro-Wilk (null hypothesis: data is drawn from normal distribution):
```python
W, p_value = scipy.stats.shapiro(cleaned_df["My Rating"])
if p_value < 0.05:
print("Rejecting null hypothesis - data does not come from a normal distribution (p=%s)"%p_value)
else:
print("Cannot reject null hypothesis (p=%s)"%p_value)
```
Rejecting null hypothesis - data does not come from a normal distribution (p=8.048559751530179e-22)
In my case, the data is not normally distributed (in other words, the book scores are not evenly distributed around the middle). If you think about it, this makes sense: most readers don't read perfectly randomly, I avoid books I believe I'd dislike, and choose books that I prefer. I rate those books higher than average, therefore, my curve of scores is slanted towards the right.
## plot Pages vs Ratings
Do I give longer books better scores? A minor tendency but nothing special (it's confounded by having just 5 possible numbers in ratings)
```python
g = sns.jointplot("Number of Pages", "My Rating", data=cleaned_df, kind="reg", height=7, ylim=[0.5,5.5])
g.annotate(scipy.stats.pearsonr)
```
C:\Users\00089503\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py:1847: UserWarning: JointGrid annotation is deprecated and will be removed in a future release.
warnings.warn(UserWarning(msg))
![png](README_files/README_10_2.png)
I seem to mostly read books at around 200 to 300 pages so it's hard to tell whether I give longer books better ratings.
It's a nice example that in regards to linear regression, a p-value as tiny as this one doesn't mean much, the r-value is still bad.
***
## plot Ratings vs Bookshelves
Let's parse ratings for books and make a violin plot for the 7 categories with the most rated books!
```python
CATEGORIES = 7 # number of most crowded categories to plot
# we have to fiddle a bit - we have to count the ratings by category,
# since each book can have several comma-delimited categories
# TODO: find a pandas-like way to do this
shelves_ratings = defaultdict(list) # key: shelf-name, value: list of ratings
shelves_counter = Counter() # counts how many books on each shelf
shelves_to_names = defaultdict(list) # key: shelf-name, value: list of book names
for index, row in cleaned_df.iterrows():
my_rating = row["My Rating"]
if my_rating == 0:
continue
if pd.isnull(row["Bookshelves"]):
continue
shelves = row["Bookshelves"].split(",")
for s in shelves:
# empty shelf?
if not s: continue
s = s.strip() # I had "non-fiction" and " non-fiction"
shelves_ratings[s].append(my_rating)
shelves_counter[s] += 10
shelves_to_names[s].append(row.Title)
names = []
ratings = []
for name, _ in shelves_counter.most_common(CATEGORIES):
for number in shelves_ratings[name]:
names.append(name)
ratings.append(number)
full_table = pd.DataFrame({"Category":names, "Rating":ratings})
# if we don't use scale=count here then each violin has the same area
sns.violinplot(x = "Category", y = "Rating", data=full_table, scale='count')
```
![png](README_files/README_12_1.png)
There is some *bad* SF out there.
At this point I wonder - since we can assign multiple 'shelves' (tags) to each book, do I have some tags that appear more often together than not? Let's use R!
```python
%load_ext rpy2.ipython
```
```python
all_shelves = shelves_counter.keys()
names_dict = {} # key: shelf name, value: robjects.StrVector of names
for c in all_shelves:
names_dict[c] = robjects.StrVector(shelves_to_names[c])
names_dict = robjects.ListVector(names_dict)
```
```r
%%R -i names_dict -r 150 -w 900 -h 600
library(UpSetR)
names_dict <- fromList(names_dict)
# by default, only 5 sets are considered, so change nsets
upset(names_dict, nsets = 9)
```
![png](README_files/README_16_0.png)
Most shelves are 'alone', but 'essays + non-fiction', 'sci-fi + sf' (should clean that up...), 'biography + non-fiction' show the biggest overlap.
I may have messed up the categories, let's cluster them! Typos should cluster together
```python
# get the Levenshtein distance between all shelf titles, normalise the distance by string length
X = np.array([[float(distance.levenshtein(shelf_1,shelf_2))/max(len(shelf_1), len(shelf_2)) \
for shelf_1 in all_shelves] for shelf_2 in all_shelves])
# scale for clustering
X = StandardScaler().fit_transform(X)
# after careful fiddling I'm settling on eps=10
clusters = DBSCAN(eps=10, min_samples=1).fit_predict(X)
print('DBSCAN made %s clusters for %s shelves/tags.'%(len(set(clusters)), len(all_shelves)))
cluster_dict = defaultdict(list)
assert len(clusters) == len(all_shelves)
for cluster_label, element in zip(clusters, all_shelves):
cluster_dict[cluster_label].append(element)
print('Clusters with more than one member:')
for k in sorted(cluster_dict):
if len(cluster_dict[k]) > 1:
print(k, cluster_dict[k])
```
DBSCAN made 166 clusters for 184 shelves/tags.
Clusters with more than one member:
1 ['fiction', 'action']
2 ['russia', 'russian']
12 ['latin-america', 'native-american']
24 ['ww1', 'ww2']
32 ['humble-bundle2', 'humble-bundle-jpsf']
47 ['essays', 'essay']
49 ['on-living', 'on-writing', 'on-thinking']
50 ['history-of-biology', 'history-of-maths', 'history-of-cs', 'history-of-philosophy']
53 ['greek', 'greece']
66 ['iceland', 'ireland']
88 ['mythology', 'psychology', 'sociology', 'theology']
116 ['philosophy', 'pop-philosophy']
126 ['letters', 'lectures']
Some clusters are problematic due to too-short label names (arab/iraq), some other clusters are good and show me that I made some mistakes in labeling! French and France should be together, Greece and Greek too. *Neat!*
(Without normalising the distance by string length clusters like horror/body-horror don't appear.)
## plotHistogramDistanceRead.py
Let's check the "dates read" for each book read and plot the distance between books read in days - shows you how quickly you hop from book to book.
I didn't use Goodreads in 2012 much so let's see how it looks like without 2012:
```python
# first, transform to datetype and get rid of all invalid dates
#dates = pd.to_datetime(cleaned_df["Date Read"])
dates = pd.to_datetime(cleaned_df["Date Added"])
dates = dates.dropna()
sorted_dates = sorted(dates)
last_date = None
differences = []
all_days = []
all_days_without_2012 = [] # not much goodreads usage in 2012 - remove that year
for date in sorted_dates:
if not last_date:
last_date = date
if date.year != 2012:
last_date_not_2012 = date
difference = date - last_date
days = difference.days
all_days.append(days)
if date.year != 2012:
all_days_without_2012.append(days)
last_date = date
sns.distplot(all_days_without_2012, axlabel="Distance in days between books read")
pylab.show()
```
![png](README_files/README_21_0.png)
***
## plot Heatmap of dates read
Parses the "dates read" for each book read, bins them by month, and makes a heatmap to show in which months I read more than in others. Also makes a lineplot for books read, split up by year.
NOTE: There is a very strange bug in Goodreads for about a year now. The exported CSV does not correctly track the date read.
```python
# we need a dataframe in this format:
# year months books_read
# I am sure there's some magic pandas function for this
read_dict = defaultdict(int) # key: (year, month), value: count of books read
for date in sorted_dates:
this_year = date.year
this_month = date.month
read_dict[ (this_year, this_month) ] += 1
first_date = sorted_dates[0]
first_year = first_date.year
first_month = first_date.month
todays_date = datetime.datetime.today()
todays_year = todays_date.year
todays_month = todays_date.month
all_years = []
all_months = []
all_counts = []
for year in range(first_year, todays_year+1):
for month in range(1, 13):
if (year == todays_year) and month > todays_month:
# don't count future months
break
this_count = read_dict[ (year, month) ]
all_years.append(year)
all_months.append(month)
all_counts.append(this_count)
# now get it in the format heatmap() wants
df = pd.DataFrame( { "month":all_months, "year":all_years, "books_read":all_counts } )
dfp = df.pivot("month", "year", "books_read")
fig, ax = plt.subplots(figsize=(10,10))
# now make the heatmap
ax = sns.heatmap(dfp, annot=True, ax=ax, square= True)
```
![png](README_files/README_23_0.png)
What happened in May 2014?
Update in 2018 - currently the 'date_read' column doesn't accurately track which books were actually read, this is a bug on Goodreads' end, see for example https://help.goodreads.com/s/question/0D51H00004ADr7o/i-have-exported-my-library-and-some-books-do-not-have-any-information-listed-for-date-read
***
## Plot books read by year
```python
g = sns.FacetGrid(df, col="year", sharey=True, sharex=True, col_wrap=4)
g.map(plt.scatter, "month", "books_read")
g.set_ylabels("Books read")
g.set_xlabels("Month")
pylab.xlim(1, 12)
pylab.show()
```
![png](README_files/README_25_0.png)
It's nice how reading behaviour (Goodreads usage) connects over the months - it slowly in 2013, stays constant in 2014/2015, and now goes down again. You can see when my first son was born!
(Solution: 2016-8-25)
(all other >2018 books are still missing their date_read dates...)
***
## Guessing authors' genders
Let's check whether I read mostly male or female authors using the gender-guesser package!
```python
first_names = cleaned_df['Author'].str.split(' ',expand=True)[0]
d = gender.Detector(case_sensitive=False)
genders = [d.get_gender(name) for name in first_names]
print(list(zip(genders[:5], first_names[:5])))
# let's also add those few 'mostly_female' and 'mostly_male' into the main grou
genders = pd.Series([x.replace('mostly_female','female').replace('mostly_male','male') for x in genders])
```
[('male', 'Don'), ('male', 'Daniil'), ('male', 'William'), ('unknown', 'E.T.A.'), ('male', 'John')]
```python
gender_ratios = genders.value_counts()
print(gender_ratios)
_ = gender_ratios.plot(kind='bar')
```
male 423
unknown 67
female 56
andy 3
dtype: int***
![png](README_files/README_28_1.png)
Now THAT'S gender bias. Do I rate the genders differently?
```python
cleaned_df['Gender'] = genders
male_scores = cleaned_df[cleaned_df['Gender'] == 'male']['My Rating'].values
female_scores = cleaned_df[cleaned_df['Gender'] == 'female']['My Rating'].values
_ = plt.hist([male_scores, female_scores], color=['r','b'], alpha=0.5)
```
![png](README_files/README_30_0.png)
Hard to tell any difference since there are so fewer women authors here - let's split them up into different plots
```python
fig, axes = plt.subplots(2,1)
axes[0].hist(male_scores, color='r', alpha=0.5, bins=10)
axes[0].set_xlabel('Scores')
# Make the y-axis label, ticks and tick labels match the line color.
axes[0].set_ylabel('male scores')
axes[1].hist(female_scores, color='b', alpha=0.5, bins=10)
axes[1].set_ylabel('female scores')
fig.tight_layout()
```
![png](README_files/README_32_0.png)
Are these two samples from the same distribution? Hard to tell since their size is so different, but let's ask Kolmogorov-Smirnov (null hypothesis: they are from the same distribution)
```python
scipy.stats.ks_2samp(male_scores, female_scores)
```
Ks_2sampResult(statistic=0.22018779342723005, pvalue=0.13257156821934568)
We cannot reject the null hypthesis as the p-value is very, very high. (but again, there are so few female scores...)
***
## Compare with Goodreads 10k
A helpful soul has uploaded ratings and stats for the 10,000 books with most ratings on Goodreads (https://github.com/zygmuntz/goodbooks-10k). Let's compare those with my ratings!
(You may have to run
git submodule update
to get the 10k submodule)
```python
other = pd.read_csv('./goodbooks-10k/books.csv')
print(other.columns)
other.head(3)
```
Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id',
'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year',
'original_title', 'title', 'language_code', 'average_rating',
'ratings_count', 'work_ratings_count', 'work_text_reviews_count',
'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5',
'image_url', 'small_image_url'],
dtype='object')
|
book_id |
goodreads_book_id |
best_book_id |
work_id |
books_count |
isbn |
isbn13 |
authors |
original_publication_year |
original_title |
... |
ratings_count |
work_ratings_count |
work_text_reviews_count |
ratings_1 |
ratings_2 |
ratings_3 |
ratings_4 |
ratings_5 |
image_url |
small_image_url |
0 |
1 |
2767052 |
2767052 |
2792775 |
272 |
439023483 |
9.780439e+12 |
Suzanne Collins |
2008.0 |
The Hunger Games |
... |
4780653 |
4942365 |
155254 |
66715 |
127936 |
560092 |
1481305 |
2706317 |
https://images.gr-assets.com/books/1447303603m... |
https://images.gr-assets.com/books/1447303603s... |
1 |
2 |
3 |
3 |
4***0799 |
491 |
439554934 |
9.780440e+12 |
J.K. Rowling, Mary GrandPre |
1997.0 |
Harry Potter and the Philosopher's Stone |
... |
4602479 |
4800065 |
75867 |
75504 |
101676 |
455024 |
1156318 |
3011543 |
https://images.gr-assets.com/books/1474154022m... |
https://images.gr-assets.com/books/1474154022s... |
2 |
3 |
41865 |
41865 |
3212258 |
226 |
316015849 |
9.780316e+12 |
Stephenie Meyer |
2005.0 |
Twilight |
... |
3866839 |
3916824 |
95009 |
456191 |
436802 |
793319 |
875073 |
1355439 |
https://images.gr-assets.com/books/1361039443m... |
https://images.gr-assets.com/books/1361039443s... |
3 rows × 23 columns
What's the gender ratio here?
```python
other_first_names = other.authors.str.split(' ',expand=True)[0]
for index, x in enumerate(other_first_names):
if x == 'J.R.R.':
other_first_names[index] = 'John'
elif x == 'J.K.':
other_first_names[index] = 'Joanne'
elif x == 'F.':
... ...