hacker-news-gpt-2

所属分类:GPT/ChatGPT
开发工具:Others
文件大小:3198KB
下载次数:0
上传日期:2019-05-20 03:57:11
上 传 者sh-1993
说明:  根据黑客新闻标题训练的GPT-2生成的文本转储
(Dump of generated texts from GPT-2 trained on Hacker News titles ,)

文件列表:
LICENSE (1065, 2019-05-20)
good_0_7.txt (1357, 2019-05-20)
good_1_0.txt (1021, 2019-05-20)
pic.png (269582, 2019-05-20)
temp_0_7 (0, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_144956.txt (68787, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145042.txt (70034, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145127.txt (67641, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145214.txt (69032, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145301.txt (69585, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145348.txt (68051, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145434.txt (69018, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145521.txt (68515, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145609.txt (68209, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_145657.txt (70350, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_152729.txt (68851, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_152816.txt (68633, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_152902.txt (68525, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_152948.txt (68665, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153034.txt (68269, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153120.txt (68871, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153207.txt (67961, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153253.txt (68256, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153342.txt (68792, 2019-05-20)
temp_0_7\gpt2_gentext_20190426_153430.txt (69217, 2019-05-20)
temp_0_7_top_p (0, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024044.txt (68291, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024133.txt (67972, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024222.txt (69542, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024310.txt (68103, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024359.txt (68179, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024447.txt (68337, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024536.txt (68912, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024625.txt (68763, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024714.txt (69680, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024805.txt (68877, 2019-05-20)
temp_0_7_top_p\gpt2_gentext_20190520_024854.txt (69131, 2019-05-20)
... ...

# hacker-news-gpt-2 ![](https://github.com/minimaxir/hacker-news-gpt-2/blob/master/pic.png) Dump of generated texts from [gpt-2-simple](https://github.com/minimaxir/hacker-news-gpt-2/blob/master/https://github.com/minimaxir/gpt-2-simple) trained on Hacker News titles until April 25th, 2019 (about 603k titles, 30MB of text) for 36,813 steps (12 hours w/ a P100 GPU, costing ~$6). The output is *definitely* not similar to that of Markov chains. For each temperature, there are 20 dumps of 1,000 titles (you can see some good curated titles in the `good_XXX.txt` files). The higher the temperature, the crazier the text. * `temp_0_7`: Normal and syntactically correct, but the AI sometimes copies existing titles verbatim. I recommend checking against HN Search. * `temp_1_0`: Crazier, mostly syntactically correct. Funnier IMO. Almost all titles are unique and have not been posted on HN before. * `temp_1_3`: Even more crazy, occasionally syntactically correct. The `top_p` variants are generated with the same temperature using nucleus sampling at `0.9`. The results are slightly crazier at each corresponding temperature, but not off-the-rails. ## How To Get the Text and Train the Model The Hacker News titles were retrieved from BigQuery (w/ [a trick](https://github.com/minimaxir/hacker-news-gpt-2/blob/master/https://stackoverflow.com/questions/7394748/whats-the-right-way-to-decode-a-string-that-has-special-html-entities-in-it/2***24550#2***24550) to decode HTML entities that occasionally clutter BQ data): ```sql CREATE TEMPORARY FUNCTION HTML_DECODE(enc STRING) RETURNS STRING LANGUAGE js AS """ var decodeHtmlEntity = function(str) { return str.replace(/&#(\\d+);/g, function(match, dec) { return String.fromCharCode(dec); }); }; try { return decodeHtmlEntity(enc);; } catch (e) { return null } return null; """; SELECT HTML_DECODE(title) FROM `bigquery-public-data.hacker_news.full` WHERE type = 'story' AND timestamp < '2019-04-25' AND score >= 5 ORDER BY timestamp ``` The file was exported as a CSV, uploaded to a GCP VM w/ P100 (120s / 100 steps), then converted to a gpt-2-simple-friendly TXT file via `gpt2.encode_csv()`. The training was initiated with the CLI command `gpt_2_simple finetune csv_encoded.txt`, and the files were generated with the CLI command `gpt_2_simple generate --temperature XXX --nsamples 1000 --batch_size 25 --length 100 --prefix "<|startoftext|>" --truncate "<|endoftext|>" --include_prefix False --nfiles 10`. The generated files were then downloaded locally. ## Maintainer/Creator Max Woolf ([@minimaxir](https://github.com/minimaxir/hacker-news-gpt-2/blob/master/https://minimaxir.com)) ## License MIT

近期下载者

相关文件


收藏者