pythonCode-Examples.zip

  • 霍元甲
    了解作者
  • Python
    开发工具
  • 1MB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • 1 积分
    下载积分
  • 3
    下载次数
  • 2016-11-22 21:40
    上传日期
很好的python文件,很好哦值得很有用的哈哈哈
pythonCode-Examples.zip
  • pythonCode Examples
  • graphics
  • histogram.py
    1.8KB
  • graphics.py
    33.3KB
  • gmane
  • gline.py
    1.7KB
  • gword.htm
    1.1KB
  • gword.jsonp
    2.8KB
  • gyear.py
    1.9KB
  • gline.htm
    763B
  • gmodel.py
    7.4KB
  • d3.layout.cloud.js
    10.6KB
  • gmane.py
    4KB
  • d3.v2.js
    247.8KB
  • README.txt
    8.2KB
  • gline.jsonp
    3.7KB
  • gword.py
    1.5KB
  • gbasic.py
    1.5KB
  • geodata
  • where.data
    6.4KB
  • where.html
    1.7KB
  • geodump.py
    1000B
  • geoload.py
    1.4KB
  • where.js
    18.5KB
  • README.txt
    4.2KB
  • old-twitter
  • twdump.py
    203B
  • twitter4.py
    663B
  • twspider.py
    1.5KB
  • twitter1.py
    328B
  • twitter2.py
    723B
  • twitter3.py
    723B
  • twfriends.py
    2.3KB
  • twjoin.py
    625B
  • pagerank
  • force.css
    108B
  • LICENSE
    1.4KB
  • spider.json
    8.4KB
  • spjson.py
    1.8KB
  • spdump.py
    387B
  • d3.v2.js
    247.8KB
  • README.txt
    4.9KB
  • spreset.py
    207B
  • spider.py
    4KB
  • force.js
    1.7KB
  • force.html
    511B
  • BeautifulSoup.py
    77.5KB
  • sprank.py
    2.9KB
  • words.txt
    1.1KB
  • wikidata.db
    16KB
  • re12.py
    223B
  • gmane.zip
    77.9KB
  • grade.py
    286B
  • geodata.zip
    16.4KB
  • geojson.py
    833B
  • count3.py
    485B
  • socket1.py
    296B
  • xml2.py
    515B
  • pals.py
    113B
  • celsius.py
    108B
  • soft.py
    224B
  • words.py
    362B
  • avenum.py
    233B
  • twdump.py
    207B
  • pagerank.zip
    91.8KB
  • socket2.py
    323B
  • count1.py
    335B
  • mailcount.py
    227B
  • search3.py
    215B
  • curl2.py
    277B
  • txtcheck.py
    562B
  • egg.py
    377B
  • romeo.txt
    167B
  • sequence.py
    223B
  • twspider.py
    1.7KB
  • wikigrade.py
    2.3KB
  • db2.py
    436B
  • twitter1.py
    518B
  • urllinks.py
    232B
  • search1.py
    113B
  • re08.py
    221B
  • json2.py
    319B
  • open.py
    104B
  • copytildone.py
    148B
  • largest.py
    213B
  • twitter2.py
    657B
  • mbox.txt
    6.4MB
  • argfile.py
    114B
  • txtcheck2.py
    665B
  • re06.py
    216B
  • search6.py
    209B
  • re05.py
    143B
  • urllink2.py
    357B
  • re13.py
    224B
  • pay3.py
    300B
  • mailtop.py
    379B
  • cleanup.sh
    133B
  • grep.py
    305B
  • txtmd5.py
    529B
  • spamave.py
    465B
内容介绍
Analyzing an EMAIL Archive from gmane and vizualizing the data using the D3 JavaScript library This is a set of tools that allow you to pull down an archive of a gmane repository using the instructions at: http://gmane.org/export.php You should install the SQLite browser in your FireFox to view and modify the databases from: https://addons.mozilla.org/en-us/firefox/addon/sqlite-manager/ The first step is to spider the gmane repository. The base URL is hard-coded in the gmane.py and is hard-coded to the Sakai developer list. You can spider another repository by changing that base url. Make sure to delete the content.sqlite file if you switch the base url. The gmane.py file operates as a spider in that it runs slowly and retrieves one mail message per second so as to avoid getting throttled by gmane.org. It stores all of its data in a database and can be interrupted and re-started as often as needed. It may take many hours to pull all the data down. So you may need to restart several times. To give you a head-start, I have put up 600MB of pre-spidered Sakai email here: https://online.dr-chuck.com/files/sakai/email/content.sqlite If you download this, you can "catch up with the latest" by running gmane.py. Navigate to the folder where you extracted the gmane.zip Here is a run of gmane.py getting the last five messages of the sakai developer list: Mac: python gmane.py Win: gmane.py How many messages:10 http://download.gmane.org/gmane.comp.cms.sakai.devel/51410/51411 9460 nealcaidin@sakaifoundation.org 2013-04-05T12:37:27-04:00 re: [building sakai] testing common cartridge http://download.gmane.org/gmane.comp.cms.sakai.devel/51411/51412 3379 samuelgutierrezjimenez@gmail.com 2013-04-06T03:30:11-07:00 re: [building sakai] melete 2.9 oracle issue (w/ possible fix) http://download.gmane.org/gmane.comp.cms.sakai.devel/51412/51413 9903 da1@vt.edu 2013-04-05T15:58:51-04:00 [building sakai] melete 2.9 oracle issue (w/ possible fix) http://download.gmane.org/gmane.comp.cms.sakai.devel/51413/51414 349265 m.shedid@elraed-it.com 2013-04-07T11:19:40+03:00 [building sakai] setup development enviroment http://download.gmane.org/gmane.comp.cms.sakai.devel/51414/51415 3481 samuelgutierrezjimenez@gmail.com 2013-04-07T02:31:16-07:00 re: [building sakai] setup development enviroment http://download.gmane.org/gmane.comp.cms.sakai.devel/51415/51416 0 Does not start with From The program scans content.sqlite from 1 up to the first message number not already spidered and starts spidering at that message. It continues spidering until it has spidered the desired number of messages or it reaches a page that does not appear to be a properly formatted message. Sometimes gmane.org is missing a message. Perhaps administrators can delete messages or perhaps they get lost - I don't know. If your spider stops, and it seems it has hit a missing message, go into the SQLite Manager and add a row with the missing id - leave all the other fields blank - and then restart gmane.py. This will unstick the spidering process and allow it to continue. These empty messages will be ignored in the next phase of the process. One nice thing is that once you have spidered all of the messages and have them in content.sqlite, you can run gmane.py again to get new messages as they get sent to the list. gmane.py will quickly scan to the end of the already-spidered pages and check if there are new messages and then quickly retrieve those messages and add them to content.sqlite. The content.sqlite data is pretty raw, with an innefficient data model, and not compressed. This is intentional as it allows you to look at content.sqlite to debug the process. It would be a bad idea to run any queries against this database as they would be slow. The second process is running the program gmodel.py. gmodel.py reads the rough/raw data from content.sqlite and produces a cleaned-up and well-modeled version of the data in the file index.sqlite. The file index.sqlite will be much smaller (often 10X smaller) than content.sqlite because it also compresses the header and body text. Each time gmodel.py runs - it completely wipes out and re-builds index.sqlite, allowing you to adjust its parameters and edit the mapping tables in content.sqlite to tweak the data cleaning process. Running gmodel.py works as follows: Mac: python gmodel.py Win: gmodel.py Loaded allsenders 1588 and mapping 28 dns mapping 1 1 2005-12-08T23:34:30-06:00 ggolden22@mac.com 251 2005-12-22T10:03:20-08:00 tpamsler@ucdavis.edu 501 2006-01-12T11:17:34-05:00 lance@indiana.edu 751 2006-01-24T11:13:28-08:00 vrajgopalan@ucmerced.edu ... The gmodel.py program does a number of data cleaing steps Domain names are truncated to two levels for .com, .org, .edu, and .net other domain names are truncated to three levels. So si.umich.edu becomes umich.edu and caret.cam.ac.uk becomes cam.ac.uk. Also mail addresses are forced to lower case and some of the @gmane.org address like the following arwhyte-63aXycvo3TyHXe+LvDLADg@public.gmane.org are converted to the real address whenever there is a matching real email address elsewhere in the message corpus. If you look in the content.sqlite database there are two tables that allow you to map both domain names and individual email addresses that change over the lifetime of the email list. For example, Steve Githens used the following email addresses over the life of the Sakai developer list: s-githens@northwestern.edu sgithens@cam.ac.uk swgithen@mtu.edu We can add two entries to the Mapping table s-githens@northwestern.edu -> swgithen@mtu.edu sgithens@cam.ac.uk -> swgithen@mtu.edu And so all the mail messages will be collected under one sender even if they used several email addresses over the lifetime of the mailing list. You can also make similar entries in the DNSMapping table if there are multiple DNS names you want mapped to a single DNS. In the Sakai data I add the following mapping: iupui.edu -> indiana.edu So all the folks from the various Indiana University campuses are tracked together You can re-run the gmodel.py over and over as you look at the data, and add mappings to make the data cleaner and cleaner. When you are done, you will have a nicely indexed version of the email in index.sqlite. This is the file to use to do data analysis. With this file, data analysis will be really quick. The first, simplest data analysis is to do a "who does the most" and "which organzation does the most"? This is done using gbasic.py: Mac: python gbasic.py Win: gbasic.py How many to dump? 5 Loaded messages= 51330 subjects= 25033 senders= 1584 Top 5 Email list participants steve.swinsburg@gmail.com 2657 azeckoski@unicon.net 1742 ieb@tfd.co.uk 1591 csev@umich.edu 1304 david.horwitz@uct.ac.za 1184 Top 5 Email list organizations gmail.com 7339 umich.edu 6243 uct.ac.za 2451 indiana.edu 2258 unicon.net 2055 You can look at the data in index.sqlite and if you find a problem, you can update the Mapping table and DNSMapping table in content.sqlite and re-run gmodel.py. There is a simple vizualization of the word frequence in the subject lines in the file gword.py: Mac: python gword.py Win: gword.py Range of counts: 33229 129 Output written to gword.js This produces the file gword.js which you can visualize using the file gword.htm. A second visualization is in gline.py. It visualizes email participation by organizations over time. Mac: python gline.py Win: gline.py Loaded messages= 51330 subjects= 25033 senders= 1584 Top 10 Oranizations ['gmail.com', 'umich.edu', 'uct.ac.za', 'indiana.edu', 'unicon.net', 'tfd.co.uk', 'berkeley.edu', 'longsight.com', 'stanford.edu', 'ox.ac.uk'] Output written to gline.js Its output is written to gline.js which is visualized using gline.htm. Some URLs for visualization ideas: https://developers.google.com/chart/ https://developers.google.com/chart/interactive/docs/gallery/motioncha
评论
    相关推荐