Pycon2015:使用简单的Python代码和云环境探索大数据

  • M4_571886
    了解作者
  • 964.9KB
    文件大小
  • zip
    文件格式
  • 0
    收藏次数
  • VIP专享
    资源类型
  • 0
    下载次数
  • 2022-06-06 20:56
    上传日期
使用简单的Python代码和云环境探索大数据 有关“使用简单的Python代码和云环境探索大数据”的演示平台和其他辅助材料。 您可以学习Hadoop Map减少使用真正的大数据,从而减少工作量和成本。 以下是逐步过程的设置过程,该过程为在Amazon AWS中运行Hadoop集群设置环境,并支持用于从Wikipedia提取数据和其他自动化活动的脚本。 在本地Windows计算机上安装Python和Ipython。 请遵循名为“ Python和IPython安装”的文档中的步骤。 Amazon AWS中的前提条件。 通过此链接获得一世。 创建Amazon AWS账户ii。 创建Amazon S3存储桶(存储)以存储输入,输出和map reducer脚本等。 iii。 创建一个Amazon EC2密钥对,以通过安全Shell(SSH)连接到Amazon EC2和EMR中的节点(虚拟服务器
Pycon2015-master.zip
  • Pycon2015-master
  • Explore Big Data using Simple Python Code and Cloud Environment_V8.pptx
    938KB
  • scripts to scrap data from wiki.txt
    1016B
  • Map Reduce Code -Ipython Notebook.html
    199.5KB
  • Extract data
  • extractgz.py
    175B
  • downloadwikifiles.sh
    287B
  • Python and IPython Installation.docx
    110.1KB
  • process output
  • processout.sh
    386B
  • README.md
    8.8KB
  • mapreduce scripts
  • reducer.py
    484B
  • mapper.py
    276B
内容介绍
# Explore Big Data using Simple Python Code and Cloud Environment Presentation deck and other supporting material on "Explore Big Data using Simple Python Code and Cloud Environment". You can learn Hadoop Map reduce using real Big data with less effort and cost. Below is the step by step procedure to setup the environment for Running Hadoop cluster in amazon aws and also supporting scripts used for both extracting the data from wikipedia and other automation of activities. 1. Installation of Python and Ipython in local windows computer. Follow the steps in the document named "Python and IPython Installation" 2. Pre-requisites in Amazon AWS. go thru this link http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-gs-prerequisites.html for i. Create Amazon AWS account ii. create Amazon S3 Bucket( Storage) for storing input,output and map reducer scripts etc. iii. create an Amazon EC2 Key pair to connect to the nodes(Virtual servers) in amazon ec2 and EMR thru secure Shell(SSH). iv. Create IAM Profile : This is required for accessing S3 storage from Ec2. Follow the steps at http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html#cli-signup. v. Provide AmazonS3FullAccess previlages to the above created IAM Provide by going to IAM -> Users, click on the user and attach policy called AmazonS3FullAccess. Now using the IAM credentials we can access S3 from Ec2. vi. To Install Putty and Puttygen and for converting the private key pair to putty format follow steps at http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/putty.html 3. Launch an Amazon EC2 instance ( Linux instance) to download wikifiles , unzip and upload them to s3( Amazon Storage) i. Launching a simple micro instance loaded with python software etc : login to aws.amazon.com, in services select EC2. Click on Launch Instance , select Amazon Linux AMI from "Chose an Amazon Machine Image" step. Select t2.micro in step 2 of " Choose an Instance Type". Click on "Review and Launch". Then in step 7, click on Launch. Before clicking on Launch, ensure that in step 7, under security groups port 22 is enabled which is required to connect to this instance from putty ssh. After clicking on launch, select the keypair you created in the above steps and click on Launch instances. Now click on View Instances. Once the instance status changes to "Running" , select the instance , at bottom under description copy the public IP address. Now launch putty session with hostname as ec2-user@IP-Adrress. Now you will be able to login to ec2 instance you created. ii. Configure AWS S3 in amazon ec2 instance ..This will let you copy files between s3 and amazon ec2. For this type "aws configure" , in the prompt enter credentials you copied when create IAM profile. for region and output format, leave blank. iii. download index file containing links for each hour in a month. say you want to extract september month log files. Type in shell prompt : wget https://dumps.wikimedia.org/other/pagecounts-raw/2015/2015-09/ Now you have index.html file copied to the directory. iv. This index file have all the links for the September month log files. So to extract the log file names from the index.html, execute the python program extractgz.py from "Extract data" folder by copying that to ec2 and typing "python extractgz.py > allfilenames" . Now a file called "allfilenames" created in the same directory with all the file names. v. update the bucket name you create with <your buckethere> " in the sh file. Execute the shell script by typing "bash downloadwikifiles.sh allfilenames". This will download each of the file ( sequentially) , unizip the file, upload it to s3 and remove local file. Normally Amazon downloads these files at the rate of 2MB by sec and each file download time is approximately 50 to 60 sec. So approximately it will take 12 hours to download the files. So run the shell script in background so that even after disconnecting the putty session, the script execution will continue. To check whether all the files are downloaded , check the S3 bucket periodically and once all the files are downloaded , you can terminate the ec2 session by going to ec2 services and terminate the instance. 4. In the s3 bucket you created, create folders called input, logfiles, output, scripts and output. 5. Upload the mapper and reducer scripts from "mapreduce scripts" folder in github to scripts folder in your s3 bucket. 6. Launch EMR ( Hadoop cluster) and Run Map reducer using Hadoop streaming. Before Launching EMR cluster, please check how much it will cost per hour in the page https://aws.amazon.com/elasticmapreduce/pricing/ . i. After loginto your aws account , select EMR under Service -> Analytics. ii. Click on Create Cluster, select 'Go to Advanced Options" iii. Give any thing for cluster name iv. For log folder s3 location , select logfiles from the s3 bucket you created and append some unique logfile folder which will be created by EMR. So your log file location should belike this "s3://<yourbucket>/logfiles/<unique folder> vi. In Hardware configuration section , for master go with m3.medium, for core select a Instance type. If you want to experiemnt to process 2-3 days files instead of one month, start with m3.xlarge with 2 or 3 count. vi. Under security and access section, select the keypair you created initially. So that you will be able to login to Master node after cluster created. vii. Under steps, select step, "streaming Program" and click on configure and add. viii. Give any name for Name of the step, provider mapper, reducer scripts to mapper and reducer. ix. For Input s3 location, provide the folder of input files. If you want to start with 2-3 days files, create another folder in your bucket and copy those files to that folder and assign that folder in this field. x. output s3 location will be your ouput folder and add an unique name after that. If you provide existing folder , the step will fail. So location should be like s3://<your bucket>/output/<New folder name> xi. Now click "add" . This step/job will be executed after the cluster is created. xii. You are ready to create cluster by clicking on "create cluster". Wait ..there is one more step remaining. xiii. For monitoring the status of the jobs and counters/configuration etc , hadoop will be pushing data to a web server in Master Node. To access the same from your local computer , you need to create a proxy tunnel. Steps for the same : xiv. Then proceed by clicking on "create cluster". 7. There are many things you can check while job is running i. Cluster will not create immediately after you click on create cluster. Amazon will take approximately 5-8 minutes to provision virtual servers, install required software, configure master and core/slave nodes. ii. As soon as the status of master in summary changes from provisioning to bootstrapping, master public DNS Ip address will be appeared. You can proxy throttle to that using putty as mentioned in above steps. Then in your firefox/chrome, you can type "http://<Master Ip Address>:8088" . And you are ready to monitor and review Cluster configuration, counters, jobs etc. Some time when you click on some links you may get can not access message, then in the URL bar replace the Internal Ip adress with Master Public IP address. iii. Once the status of both Master and Core changes to "running" , under the steps first hadoop debugging step will complete. Then the job/step you created will execute. iv. You can monitor the step status by clickon on view jobs against the step, and click on view tasks. You can see how many total tasks, pending tasks, completed tasks and running tasks. v. Once the job completed, the out put will be stored in ou
评论
    相关推荐
    • conch:无需密钥对即可快速SSH到公共EC2实例的实用程序
      -profile {AWS Profile name} -指定要使用的AWS配置文件 -region {AWS Region} -指定要使用的AWS区域 -user {EC2 Server instance user} -在EC2实例上指定OS用户 -port {SSH port} -指定SSH端口 -v详细日志记录
    • aws-ssh-config, 从 AWS EC2清单生成SSH配置文件.zip
      aws-ssh-config, 从 AWS EC2清单生成SSH配置文件 aws-ssh-config描述使用boto查询 AWS API并生成一个准备使用的SSH配置文件的非常简单的脚本。 有几个类似的脚本,但我找不到一个能够满足我所有的愿望列表:立即连接...
    • remotgo:通过SSH将命令发送到AWS EC2实例
      通过SSH将命令发送到AWS EC2实例 例子 在以Highway为角色的所有实例中执行“ df -H”命令。 用法 使用一个或多个-t(-tag)从将接收用-c(-command)指定的命令的实例中选择标签。 如果需要为ssh连接提供它,请使用...
    • aws-ssh:轻松通过SSH进入您的AWS EC2实例!
      AWS SSH 通过名称SSH进入您特定于项目的AWS EC2实例,而无需记住IP地址和私钥或管理SSH配置。 转到此: ssh -i ~/.ssh/project-key.pem ubuntu@198.51.100.13 入此: aws-ssh compute 入门 先决条件 AWS-SSH在...
    • aws-shortcuts:命令行中用于AWS EC2实例控制的快捷方式
      AWS CLI快捷方式-简化实例管理 直接从命令行管理程序轻松列出,控制和连接实例 直接使用AWSS从命令行识别,控制和连接实例。 它不需要参数,并且在指定实例时允许使用通配符,因此当已知最少实例详细信息或多个实例...
    • ec2ssh:一个脚本,可以更轻松地SSH到运行中的Amazon EC2实例
      由于ec2实例的公共主机名是动态的,并且不容易记住或键入,因此该脚本提供了所有正在运行的实例的列表,因此您可以轻松选择要转换的实例(而无需在每次需要时都通过aws控制台仪式)主机名)。 您可以配置多个帐户...
    • aws:简化 Linux EC2 主机的 SSH 管理。 支持多账户
      aws 我发现查找 IP/密钥组合以通过 SSH 连接到 Linux EC2 主机非常烦人。 管理多个 AWS 账户将这提升到另...example1/example1-web01: ssh -i ~ /aws/keys/example1.pem ec2-user@ec2-1-2-3-4.compute-1.amazonaws.com
    • iam-ssh:使用IAM用户通过SSHAWS ec2实例
      使用IAM管理AWS EC2 SSH访问 2019年6月:签出作为该项目的替代品 2018年9月:检出以替代该项目 使用IAM用户的公共SSH密钥通过SSH访问正在运行的EC2实例 亚马逊Linux 2017.09 亚马逊Linux 2 2017.12 Ubuntu 16.04 ...
    • aws-fuzzy-finder, 使用模糊搜索将SSH转换为实例.zip
      aws-fuzzy-finder, 使用模糊搜索将SSH转换为实例 自动查找器aws-fuzzy-finder的目标是: 在EC2实例中进行your和ssh查找的过程超快速和简单。 它将与AWS连接,自动获取所有可以访问的实例,并以模糊可以搜索方式将它...
    • ec2ssh:适用于AWS EC2的ssh_config管理器
      ec2ssh是Amazon EC2的ssh_config管理器。 ec2ssh命令将Host描述添加到ssh_config(默认为〜/ .ssh / config)。 实例的“名称”标记用作Host描述。 如何使用 1.为您的实例设置“名称”标签 例如。 将“ app-server-...