gcp-serverless-mapreduce

所属分类:大数据
开发工具:GO
文件大小:0KB
下载次数:0
上传日期:2023-09-06 16:17:58
上 传 者sh-1993
说明:  gcp无服务器mapreduce,,
(gcp-serverless-mapreduce,,)

文件列表:
.env.example (73, 2023-09-06)
.gcloudignore (525, 2023-09-06)
Makefile (1875, 2023-09-06)
architecture.png (120265, 2023-09-06)
controller/ (0, 2023-09-06)
controller/controller.go (3765, 2023-09-06)
controller/controller_test.go (4810, 2023-09-06)
controller/delete-controller.sh (706, 2023-09-06)
controller/deploy-controller.sh (1601, 2023-09-06)
docker-compose.yaml (533, 2023-09-06)
functions.go (888, 2023-09-06)
go.mod (431, 2023-09-06)
go.sum (103484, 2023-09-06)
mapphase/ (0, 2023-09-06)
mapphase/combine.go (1805, 2023-09-06)
mapphase/combine_test.go (3956, 2023-09-06)
mapphase/delete-combiner.sh (675, 2023-09-06)
mapphase/delete-mapper.sh (659, 2023-09-06)
mapphase/delete-splitter.sh (677, 2023-09-06)
mapphase/delete-starter.sh (407, 2023-09-06)
mapphase/deploy-combiner.sh (1283, 2023-09-06)
mapphase/deploy-mapper.sh (1259, 2023-09-06)
mapphase/deploy-splitter.sh (1274, 2023-09-06)
mapphase/deploy-starter.sh (602, 2023-09-06)
mapphase/map.go (7278, 2023-09-06)
mapphase/map_test.go (4689, 2023-09-06)
mapphase/split.go (7619, 2023-09-06)
mapphase/split_test.go (7729, 2023-09-06)
mapphase/start.go (3617, 2023-09-06)
mapphase/start_test.go (6358, 2023-09-06)
pubsub/ (0, 2023-09-06)
pubsub/client.go (2755, 2023-09-06)
pubsub/client_test.go (3282, 2023-09-06)
pubsub/types.go (2155, 2023-09-06)
redis/ (0, 2023-09-06)
redis/initClient.go (1751, 2023-09-06)
reducephase/ (0, 2023-09-06)
... ...

# Serverless MapReduce to find Anagrams in a dataset ### Contents - [Introduction](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#introduction) - [Prerequisites](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#prerequisites) - [Deployment](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#deployment) - [Running the Anagram MapReduce](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#running-the-anagram-mapreduce) - [Option 1](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#option-1) - [Option 2](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#option-2) - [Results](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#results) - [Tests](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#tests) - [22COC105 Output](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#22coc105-output) ### Introduction This project is a Serverless MapReduce implementation to find anagrams in a set of Project Gutenberg books in the format of text files. The project is written in Go and uses GCP Cloud Functions, Cloud Pub/Sub, Cloud Storage and Cloud Memorystore for Redis. I chose to write the project in Go as it is a language that is very fast, and easy to write concurrent code in, thus allowing the MapReduce process to run at high speed. The project was written for the 22COC105 Cloud Computing coursework. The project is split into 2 main parts as the MapReduce programming model follows. The first is the map phase, in which the data is split, preprocessed, mapped to a key-value pair and combined at a per book level (mini-reduce). The second is the reduce phase, in which the key-value pairs are shuffled into redis instances based on a hash of the key, and then reduced together to find the anagrams. The whole process takes under 20 seconds to run on 100 books, which is equivalent to ~43MB of data. ### Prerequisites - In order to deploy any part of the project, you need to have gcloud CLI installed and configured. You can find the instructions [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://cloud.google.com/sdk/docs/quickstarts). - You need to have a GCP project with billing enabled. You can find the instructions [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://cloud.google.com/billing/docs/how-to/modify-project). - In order to run the tests, you need to have Go version 1.16 installed. You can find the instructions [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://golang.org/doc/install). You will also need to have Docker installed. You can find the instructions [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://docs.docker.com/get-docker/). - Run `chmod +x ./*/*.sh` in the project root to make the scripts executable. ### Deployment The first step is to create a `.env` file and set the `GCP_PROJECT` variable to the name of the GCP project you wish to deploy everything to, the `GCP_REGION` variable to the region you wish to deploy to (you can find the list of available regions [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://cloud.google.com/compute/docs/regions-zones)), and the `NO_OF_REDUCERS` variable to the number of reducer jobs you want to run (I used 5). An example file `.env.example` is provided in the root of the project. You can copy it to `.env` and modify it to your needs using `cp .env.example .env`. You should then create two buckets in GCP Cloud Storage, one for the input data and one for the output data. You can either do this using the GCP console or by using the following commands (replace `$GCP_PROJECT` with the name of your GCP project and `$GCP_REGION` with the region you wish to store your data in): ```bash gsutil mb -p $GCP_PROJECT -l $GCP_REGION gs://$GCP_PROJECT-input gsutil mb -p $GCP_PROJECT -l $GCP_REGION gs://$GCP_PROJECT-output ``` These bucket names should then be used when starting the MapReduce process. See [running the Anagram MapReduce](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/#running-the-anagram-mapreduce) for more details. Deploying the functions and Redis instances is extremely easy due to the Bash scripts provided in each directory. To run these scripts, Make commands are provided. You can find the Makefile in the root directory of the project. The commands are: **Note:** The commands are run from the root directory of the project. **Note:** When you receive the following message: `Allow unauthenticated invocations of new function [starter]? (y/N)` Choose `y` ```bash # Deploy all the functions and Redis instances make deploy # OR individually # Deploy the Redis instances make deploy-redis # Deploy the controller function make deploy-controller # Deploy the starter function make deploy-starter # Deploy the splitter function make deploy-splitter # Deploy the mapper function make deploy-mapper # Deploy the combiner function make deploy-combiner # Deploy the shuffler function make deploy-shuffler # Deploy the reducer function make deploy-reducer ``` Once deployed, you should find the following resources in your GCP project: - Cloud Functions - `controller` - `starter` - `splitter` - `mapper` - `combiner` - `shuffler` - `reducer` - Cloud Pub/Sub Topics - `mapreduce-controller` - `mapreduce-splitter` - `mapreduce-mapper` - `mapreduce-combiner` - `mapreduce-shuffler` - `mapreduce-reducer` - Cloud Memorystore for Redis - N instances, where N is the number of reducers you specified in the `.env` file - Serverless VPC Access connector - `mapreduce-connector` Here is a diagram of how the architecture is linked using 5 redis instances and thus 5 reduce jobs: ![Architecture Diagram](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/./architecture.png) Similarly, you can delete the functions and Redis instances using the following commands: ```bash # Delete all the functions and redis instances make remove # OR individually # Delete the Redis instances make remove-redis # Delete the controller function make remove-controller # Delete the starter function make remove-starter # Delete the splitter function make remove-splitter # Delete the mapper function make remove-mapper # Delete the combiner function make remove-combiner # Delete the shuffler function make remove-shuffler # Delete the reducer function make remove-reducer ``` **Note:** The deployment scripts are written in Bash and were tested on Linux and macOS. They may not work on Windows. **Warning:** The deployment scripts create several services in GCP that you will be charged monthly for - namely Redis instances and a serverless VPC connector. Make sure you delete these after you are done running the project. ### Running the Anagram Mapreduce To start the MapReduce running on a dataset, you need to ensure that the data is in a Google Cloud Storage bucket. There are two ways of starting the MapReduce: #### Option 1 Using the Make command which will run the script `./scripts/start-anagram-mapreduce.sh` that asks you to provide the names of the input and output buckets. This is the easiest way to start the MapReduce. ```bash make start ``` #### Option 2 Find out the uri you need to call the starter function with using the command (replace $GCP_REGION and $GCP_PROJECT with the region and project you deployed to): ```bash gcloud functions describe starter --region=$GCP_REGION --project=$GCP_PROJECT --format="value(serviceConfig.uri)" ``` Then, you can call the starter function using the following command (replace $URI, $INPUT_BUCKET and $OUTPUT_BUCKET with the values you know): ```bash curl -X GET "$URI?input_bucket=$INPUT_BUCKET&output_bucket=$OUTPUT_BUCKET" | jq ``` #### Results Upon the successful starting of the MapReduce, you will receive a similar response: ```json { "responseCode": 200, "message": "MapReduce started successfully - results will be stored in: serverless-mapreduce-output" } ``` In order to check whether the mapreduce has finished, you can use the following command (where $OUTPUT_BUCKET is the name of the bucket you provided as the output bucket): ```bash gsutil ls gs://$OUTPUT_BUCKET | grep -E "anagrams-part-[0-9]+.txt" ``` If you deployed the project with 5 reducer jobs, you should see 5 files in the output bucket. In this case, if you see less than 5 files, it means that the reduce phase is still running. To retrieve the files, you can use the following command (where $OUTPUT_BUCKET is the name of the bucket you provided as the output bucket): ```bash gsutil -m cp -R gs://$OUTPUT_BUCKET . ``` In each file, each line is in the format of `sorted_word: word1 word2 word3 ... wordN`, where {word1, word2, word3, ..., wordN} is a set of sorted anagrams. For example a line could be:`aet: ate eat tea`. ### Tests Unit tests exist that allow you to test the full functionality of each cloud function. All of these tests, including the creation and removal of the Docker containers used in them, can be run using the command: ```bash make test ``` **Note:** Sometimes the splitter tests can fail due to the fact that the storage emulator is not ready when the tests are, and the tests fail. If this happens, just remove the containers using `make teardown-test` and run the tests again. For more information on the tests, see below Before running any tests, you will need to run several Docker images that mock GCP cloud services used in the project. This includes an official image by Google `gcr.io/google.com/cloudsdktool/cloud-sdk:latest` which I use to mock the PubSub service, and an open-source image `oittaa/gcp-storage-emulator` found [here](https://github.com/camwhite18/gcp-serverless-mapreduce/blob/master/https://hub.docker.com/r/oittaa/gcp-storage-emulator) to mock Storage Buckets. I also use the official Redis image `redis-stack:latest` to create a local Redis instance running in a Docker container. If you have Docker installed, you can run the following command to create the containers: ```bash make setup-test ``` Or alternatively, you can run a subset of the containers by running any of the following commands: ```bash make create-pubsub-emulator make create-storage-emulator make create-local-redis ``` To remove the containers, you can run the following command: ```bash make teardown-test ``` Or alternatively, you can remove a subset of the containers by running any of the following commands: ```bash make remove-pubsub-emulator make remove-storage-emulator make remove-local-redis ``` Unit tests exist for all the functions written in the project that mock the actual functionality of there use in GCP. I used these throughout the development of the MapReduce in order to ensure that no changes I made to my code caused anything to break. To run these unit tests, one can run the command: ```bash make test-unit ``` A script also exists that allows for checking of the test coverage of each package (note that you will need a newer version of bash that allows `declare -A`, I installed it using Homebrew). This can be run using the command: ```bash make test-coverage ``` ### 22COC105 Output A bucket exists in GCP that contains the output files containing all the anagrams from the 100 books as required by the 22COC105 coursework. The bucket is accessible from the internet read-only. Therefore, in order to download these files, one can run the command: ```bash gsutil -m cp -R gs://serverless-mapreduce-output/ . ```

近期下载者

相关文件


收藏者