Goodreads_visualization

所属分类:数据可视化
开发工具:Makefile
文件大小:4758KB
下载次数:0
上传日期:2020-03-25 13:26:46
上 传 者sh-1993
说明:  Jupyter笔记本,我在其中玩Goodreads数据
(A Jupyter notebook where I play with my Goodreads data)

文件列表:
Makefile (52, 2020-03-25)
... ...

[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/philippbayer/Goodreads_visualization/master?filepath=README.ipynb) # Goodreads visualization A Jupyter notebook to play around with Goodreads data and make some seaborn visualizations, learn more about scikit-learn, my own playground! You can use it with your own data - go [here](https://www.goodreads.com/review/import) and press "Export your library" to get your own csv. The text you're reading is generated from a jupyter notebook by the Makefile. If you want to run it yourself, clone the repository then run jupyter notebook your_file.ipynb to get the interactive version. In there, replace the path to my Goodreads exported file by yours in the ipynb file, and then run click on Cell -> Run All. ** WARNING ** It seems that there's currently a bug on Goodreads' end with the export of data, as many recently 'read' books have a read-date which is shown on the web page but doesn't show up in the CSV. ## Dependencies * Python (3! rpy2 doesn't work under Python2 any more) * Jupyter * R (for rpy2) ### Python packages * seaborn * pandas * wordcloud * nltk * networkx * pymarkovchain * scikit-learn * distance * image (PIL inside python for some weird reason) * gender_guesser * rpy2 To install all: pip install seaborn wordcloud nltk networkx pymarkovchain image sklearn distance gender_guesser rpy2 Under Windows and anaconda you instead need to run conda install rpy2 instead of using pip to install rpy2. ## Licenses License for reviews: CC-BY-SA 4.0 Code: MIT OK, let's start! ## Setting up the notebook ```python %pylab inline # for most plots import numpy as np import pandas as pd import seaborn as sns from collections import defaultdict, Counter, OrderedDict # for stats import scipy.stats # for time-related plots import datetime import calendar # for word cloud import re import string from nltk.corpus import stopwords from wordcloud import WordCloud # for Markov chain from pymarkovchain import MarkovChain import pickle import networkx as nx # for shelf clustering import distance from sklearn.preprocessing import StandardScaler from sklearn.cluster import DBSCAN sns.set_palette("coolwarm") # for plotting images from IPython.display import Image import gender_guesser.detector as gender # for R import pandas from rpy2 import robjects # conda install -c r rpy2 on Windows import matplotlib.pyplot as plt plt.rcParams['figure.figsize'] = [10, 5] ``` Populating the interactive namespace from numpy and matplotlib ## Loading the data ```python df = pd.read_csv('./goodreads_library_export.csv') # keep only books that have a rating (unrated books have a rating of 0, we don't need that) cleaned_df = df[df["My Rating"] != 0] # get rid of noise in 2012 cleaned_df = cleaned_df[(cleaned_df['Date Added'] > '2013-01-01')] ``` # Score distribution With a score scale of 1-5, you'd expect that the average score is ~~2.5~~ 3 (since 0 is not counted) after a few hundred books (in other words, is it a normal distribution?) ```python g = sns.distplot(cleaned_df["My Rating"], kde=False) "Average: %.2f"%cleaned_df["My Rating"].mean(), "Median: %s"%cleaned_df["My Rating"].median() ``` ('Average: 3.54', 'Median: 4.0') ![png](README_files/README_5_1.png) That doesn't look normally distributed to me - let's ask Shapiro-Wilk (null hypothesis: data is drawn from normal distribution): ```python W, p_value = scipy.stats.shapiro(cleaned_df["My Rating"]) if p_value < 0.05: print("Rejecting null hypothesis - data does not come from a normal distribution (p=%s)"%p_value) else: print("Cannot reject null hypothesis (p=%s)"%p_value) ``` Rejecting null hypothesis - data does not come from a normal distribution (p=8.048559751530179e-22) In my case, the data is not normally distributed (in other words, the book scores are not evenly distributed around the middle). If you think about it, this makes sense: most readers don't read perfectly randomly, I avoid books I believe I'd dislike, and choose books that I prefer. I rate those books higher than average, therefore, my curve of scores is slanted towards the right. ## plot Pages vs Ratings Do I give longer books better scores? A minor tendency but nothing special (it's confounded by having just 5 possible numbers in ratings) ```python g = sns.jointplot("Number of Pages", "My Rating", data=cleaned_df, kind="reg", height=7, ylim=[0.5,5.5]) g.annotate(scipy.stats.pearsonr) ``` C:\Users\00089503\AppData\Local\Continuum\anaconda3\lib\site-packages\seaborn\axisgrid.py:1847: UserWarning: JointGrid annotation is deprecated and will be removed in a future release. warnings.warn(UserWarning(msg)) ![png](README_files/README_10_2.png) I seem to mostly read books at around 200 to 300 pages so it's hard to tell whether I give longer books better ratings. It's a nice example that in regards to linear regression, a p-value as tiny as this one doesn't mean much, the r-value is still bad. *** ## plot Ratings vs Bookshelves Let's parse ratings for books and make a violin plot for the 7 categories with the most rated books! ```python CATEGORIES = 7 # number of most crowded categories to plot # we have to fiddle a bit - we have to count the ratings by category, # since each book can have several comma-delimited categories # TODO: find a pandas-like way to do this shelves_ratings = defaultdict(list) # key: shelf-name, value: list of ratings shelves_counter = Counter() # counts how many books on each shelf shelves_to_names = defaultdict(list) # key: shelf-name, value: list of book names for index, row in cleaned_df.iterrows(): my_rating = row["My Rating"] if my_rating == 0: continue if pd.isnull(row["Bookshelves"]): continue shelves = row["Bookshelves"].split(",") for s in shelves: # empty shelf? if not s: continue s = s.strip() # I had "non-fiction" and " non-fiction" shelves_ratings[s].append(my_rating) shelves_counter[s] += 10 shelves_to_names[s].append(row.Title) names = [] ratings = [] for name, _ in shelves_counter.most_common(CATEGORIES): for number in shelves_ratings[name]: names.append(name) ratings.append(number) full_table = pd.DataFrame({"Category":names, "Rating":ratings}) # if we don't use scale=count here then each violin has the same area sns.violinplot(x = "Category", y = "Rating", data=full_table, scale='count') ``` ![png](README_files/README_12_1.png) There is some *bad* SF out there. At this point I wonder - since we can assign multiple 'shelves' (tags) to each book, do I have some tags that appear more often together than not? Let's use R! ```python %load_ext rpy2.ipython ``` ```python all_shelves = shelves_counter.keys() names_dict = {} # key: shelf name, value: robjects.StrVector of names for c in all_shelves: names_dict[c] = robjects.StrVector(shelves_to_names[c]) names_dict = robjects.ListVector(names_dict) ``` ```r %%R -i names_dict -r 150 -w 900 -h 600 library(UpSetR) names_dict <- fromList(names_dict) # by default, only 5 sets are considered, so change nsets upset(names_dict, nsets = 9) ``` ![png](README_files/README_16_0.png) Most shelves are 'alone', but 'essays + non-fiction', 'sci-fi + sf' (should clean that up...), 'biography + non-fiction' show the biggest overlap. I may have messed up the categories, let's cluster them! Typos should cluster together ```python # get the Levenshtein distance between all shelf titles, normalise the distance by string length X = np.array([[float(distance.levenshtein(shelf_1,shelf_2))/max(len(shelf_1), len(shelf_2)) \ for shelf_1 in all_shelves] for shelf_2 in all_shelves]) # scale for clustering X = StandardScaler().fit_transform(X) # after careful fiddling I'm settling on eps=10 clusters = DBSCAN(eps=10, min_samples=1).fit_predict(X) print('DBSCAN made %s clusters for %s shelves/tags.'%(len(set(clusters)), len(all_shelves))) cluster_dict = defaultdict(list) assert len(clusters) == len(all_shelves) for cluster_label, element in zip(clusters, all_shelves): cluster_dict[cluster_label].append(element) print('Clusters with more than one member:') for k in sorted(cluster_dict): if len(cluster_dict[k]) > 1: print(k, cluster_dict[k]) ``` DBSCAN made 166 clusters for 184 shelves/tags. Clusters with more than one member: 1 ['fiction', 'action'] 2 ['russia', 'russian'] 12 ['latin-america', 'native-american'] 24 ['ww1', 'ww2'] 32 ['humble-bundle2', 'humble-bundle-jpsf'] 47 ['essays', 'essay'] 49 ['on-living', 'on-writing', 'on-thinking'] 50 ['history-of-biology', 'history-of-maths', 'history-of-cs', 'history-of-philosophy'] 53 ['greek', 'greece'] 66 ['iceland', 'ireland'] 88 ['mythology', 'psychology', 'sociology', 'theology'] 116 ['philosophy', 'pop-philosophy'] 126 ['letters', 'lectures'] Some clusters are problematic due to too-short label names (arab/iraq), some other clusters are good and show me that I made some mistakes in labeling! French and France should be together, Greece and Greek too. *Neat!* (Without normalising the distance by string length clusters like horror/body-horror don't appear.) ## plotHistogramDistanceRead.py Let's check the "dates read" for each book read and plot the distance between books read in days - shows you how quickly you hop from book to book. I didn't use Goodreads in 2012 much so let's see how it looks like without 2012: ```python # first, transform to datetype and get rid of all invalid dates #dates = pd.to_datetime(cleaned_df["Date Read"]) dates = pd.to_datetime(cleaned_df["Date Added"]) dates = dates.dropna() sorted_dates = sorted(dates) last_date = None differences = [] all_days = [] all_days_without_2012 = [] # not much goodreads usage in 2012 - remove that year for date in sorted_dates: if not last_date: last_date = date if date.year != 2012: last_date_not_2012 = date difference = date - last_date days = difference.days all_days.append(days) if date.year != 2012: all_days_without_2012.append(days) last_date = date sns.distplot(all_days_without_2012, axlabel="Distance in days between books read") pylab.show() ``` ![png](README_files/README_21_0.png) *** ## plot Heatmap of dates read Parses the "dates read" for each book read, bins them by month, and makes a heatmap to show in which months I read more than in others. Also makes a lineplot for books read, split up by year. NOTE: There is a very strange bug in Goodreads for about a year now. The exported CSV does not correctly track the date read. ```python # we need a dataframe in this format: # year months books_read # I am sure there's some magic pandas function for this read_dict = defaultdict(int) # key: (year, month), value: count of books read for date in sorted_dates: this_year = date.year this_month = date.month read_dict[ (this_year, this_month) ] += 1 first_date = sorted_dates[0] first_year = first_date.year first_month = first_date.month todays_date = datetime.datetime.today() todays_year = todays_date.year todays_month = todays_date.month all_years = [] all_months = [] all_counts = [] for year in range(first_year, todays_year+1): for month in range(1, 13): if (year == todays_year) and month > todays_month: # don't count future months break this_count = read_dict[ (year, month) ] all_years.append(year) all_months.append(month) all_counts.append(this_count) # now get it in the format heatmap() wants df = pd.DataFrame( { "month":all_months, "year":all_years, "books_read":all_counts } ) dfp = df.pivot("month", "year", "books_read") fig, ax = plt.subplots(figsize=(10,10)) # now make the heatmap ax = sns.heatmap(dfp, annot=True, ax=ax, square= True) ``` ![png](README_files/README_23_0.png) What happened in May 2014? Update in 2018 - currently the 'date_read' column doesn't accurately track which books were actually read, this is a bug on Goodreads' end, see for example https://help.goodreads.com/s/question/0D51H00004ADr7o/i-have-exported-my-library-and-some-books-do-not-have-any-information-listed-for-date-read *** ## Plot books read by year ```python g = sns.FacetGrid(df, col="year", sharey=True, sharex=True, col_wrap=4) g.map(plt.scatter, "month", "books_read") g.set_ylabels("Books read") g.set_xlabels("Month") pylab.xlim(1, 12) pylab.show() ``` ![png](README_files/README_25_0.png) It's nice how reading behaviour (Goodreads usage) connects over the months - it slowly in 2013, stays constant in 2014/2015, and now goes down again. You can see when my first son was born! (Solution: 2016-8-25) (all other >2018 books are still missing their date_read dates...) *** ## Guessing authors' genders Let's check whether I read mostly male or female authors using the gender-guesser package! ```python first_names = cleaned_df['Author'].str.split(' ',expand=True)[0] d = gender.Detector(case_sensitive=False) genders = [d.get_gender(name) for name in first_names] print(list(zip(genders[:5], first_names[:5]))) # let's also add those few 'mostly_female' and 'mostly_male' into the main grou genders = pd.Series([x.replace('mostly_female','female').replace('mostly_male','male') for x in genders]) ``` [('male', 'Don'), ('male', 'Daniil'), ('male', 'William'), ('unknown', 'E.T.A.'), ('male', 'John')] ```python gender_ratios = genders.value_counts() print(gender_ratios) _ = gender_ratios.plot(kind='bar') ``` male 423 unknown 67 female 56 andy 3 dtype: int*** ![png](README_files/README_28_1.png) Now THAT'S gender bias. Do I rate the genders differently? ```python cleaned_df['Gender'] = genders male_scores = cleaned_df[cleaned_df['Gender'] == 'male']['My Rating'].values female_scores = cleaned_df[cleaned_df['Gender'] == 'female']['My Rating'].values _ = plt.hist([male_scores, female_scores], color=['r','b'], alpha=0.5) ``` ![png](README_files/README_30_0.png) Hard to tell any difference since there are so fewer women authors here - let's split them up into different plots ```python fig, axes = plt.subplots(2,1) axes[0].hist(male_scores, color='r', alpha=0.5, bins=10) axes[0].set_xlabel('Scores') # Make the y-axis label, ticks and tick labels match the line color. axes[0].set_ylabel('male scores') axes[1].hist(female_scores, color='b', alpha=0.5, bins=10) axes[1].set_ylabel('female scores') fig.tight_layout() ``` ![png](README_files/README_32_0.png) Are these two samples from the same distribution? Hard to tell since their size is so different, but let's ask Kolmogorov-Smirnov (null hypothesis: they are from the same distribution) ```python scipy.stats.ks_2samp(male_scores, female_scores) ``` Ks_2sampResult(statistic=0.22018779342723005, pvalue=0.13257156821934568) We cannot reject the null hypthesis as the p-value is very, very high. (but again, there are so few female scores...) *** ## Compare with Goodreads 10k A helpful soul has uploaded ratings and stats for the 10,000 books with most ratings on Goodreads (https://github.com/zygmuntz/goodbooks-10k). Let's compare those with my ratings! (You may have to run git submodule update to get the 10k submodule) ```python other = pd.read_csv('./goodbooks-10k/books.csv') print(other.columns) other.head(3) ``` Index(['book_id', 'goodreads_book_id', 'best_book_id', 'work_id', 'books_count', 'isbn', 'isbn13', 'authors', 'original_publication_year', 'original_title', 'title', 'language_code', 'average_rating', 'ratings_count', 'work_ratings_count', 'work_text_reviews_count', 'ratings_1', 'ratings_2', 'ratings_3', 'ratings_4', 'ratings_5', 'image_url', 'small_image_url'], dtype='object')
book_id goodreads_book_id best_book_id work_id books_count isbn isbn13 authors original_publication_year original_title ... ratings_count work_ratings_count work_text_reviews_count ratings_1 ratings_2 ratings_3 ratings_4 ratings_5 image_url small_image_url
0 1 2767052 2767052 2792775 272 439023483 9.780439e+12 Suzanne Collins 2008.0 The Hunger Games ... 4780653 4942365 155254 66715 127936 560092 1481305 2706317 https://images.gr-assets.com/books/1447303603m... https://images.gr-assets.com/books/1447303603s...
1 2 3 3 4***0799 491 439554934 9.780440e+12 J.K. Rowling, Mary GrandPre 1997.0 Harry Potter and the Philosopher's Stone ... 4602479 4800065 75867 75504 101676 455024 1156318 3011543 https://images.gr-assets.com/books/1474154022m... https://images.gr-assets.com/books/1474154022s...
2 3 41865 41865 3212258 226 316015849 9.780316e+12 Stephenie Meyer 2005.0 Twilight ... 3866839 3916824 95009 456191 436802 793319 875073 1355439 https://images.gr-assets.com/books/1361039443m... https://images.gr-assets.com/books/1361039443s...

3 rows × 23 columns

What's the gender ratio here? ```python other_first_names = other.authors.str.split(' ',expand=True)[0] for index, x in enumerate(other_first_names): if x == 'J.R.R.': other_first_names[index] = 'John' elif x == 'J.K.': other_first_names[index] = 'Joanne' elif x == 'F.': ... ...

近期下载者

相关文件


收藏者