julia-for-data-science

所属分类:大数据
开发工具:Julia
文件大小:0KB
下载次数:0
上传日期:2018-09-12 05:57:26
上 传 者sh-1993
说明:  在数据科学和机器学习中探索Julia编程语言。,
(Exploring Julia programming language in doing Data Science and Machine Learning .,)

文件列表:
Full_Code.jl (3485, 2018-09-11)
hiJulia_files/ (0, 2018-09-11)
hiJulia_files/figure-markdown_github/ (0, 2018-09-11)
hiJulia_files/figure-markdown_github/box.png (31235, 2018-09-11)
hiJulia_files/figure-markdown_github/cluster.png (31775, 2018-09-11)
hiJulia_files/figure-markdown_github/density.png (32219, 2018-09-11)
hiJulia_files/figure-markdown_github/hex.png (82588, 2018-09-11)
hiJulia_files/figure-markdown_github/line.png (25765, 2018-09-11)
home_data-train .csv (2162332, 2018-09-11)

Julia for Data Science ================ [Exploring JULIA's power in doing Data Science ...](https://github.com/MNoorFawi/julia-for-data-science) ------------------------------------------------- ### #### What is and Why Julia ?! ###### Julia is a high-level general-purpose dynamic programming language, that was originally designed to address the needs of high-performance numerical analysis and computational science, without the typical need of separate compilation to be fast. ###### Julia is FAST. Julia was designed from the beginning for high performance and GPU acceleration. It's General, Dyamic and fast-growing, its community is expanding magnificently. ##### One beautiful thing about Julia is that its syntax resembles that of R. and as an R lover, I have begun to love Julia ... ###### N.B. we'll be using Julia version 0.6.4 Here we're going to explore some of what can be done in Data Science with Julia. First we need to install and load some Julia Packages. ``` julia Pkg.add("DataFrames") Pkg.add("Query") # ..... using DataFrames, Query, Knet, Gadfly, Cairo, Clustering, RDatasets ``` ### Exploring, Manipulating and Visualizing Data Julia has a data structure called [**DataFrame**](http://juliadata.github.io/DataFrames.jl/stable/index.html) which is similar to R's and pandas' data frame. Julia's DataFrame behaves in the same way as other languages dataframes with method for slicing, changing, reshaping etc. and has a package called [**Query**](http://www.queryverse.org/Query.jl/stable/) that can be used to manipulate it with **SQL**-like syntax. Query is very beautiful and it reminds me of R's **dplyr**. Julia also has [**Gadfly**](http://gadflyjl.org/stable/index.html) which is a package built on top of **R's ggplot2** that can be used for data visualization. ###### We will use some R datasets to explore them. Creating a DataFrame ... ``` julia Uefa_Goalscorers = DataFrame(Player = ["Cristiano Ronaldo", "Lionel Messi", "Raul Gonzalez"], Goals = [120, 100, 71]) Uefa_Goalscorers # # │ Row │ Player │ Goals │ # ├─────┼───────────────────┼───────┤ # │ 1 │ Cristiano Ronaldo │ 120 │ # │ 2 │ Lionel Messi │ 100 │ # │ 3 │ Raul Gonzalez │ 71 │ describe(Uefa_Goalscorers) # 2×8 DataFrames.DataFrame # │ Row │ variable │ mean │ min │ median │ max │ nunique │ nmissing │ eltype │ # ├─────┼──────────┼──────┼───────────────────┼────────┼───────────────┼─────────┼──────────┼────────┤ # │ 1 │ Player │ │ Cristiano Ronaldo │ │ Raul Gonzalez │ 3 │ │ String │ # │ 2 │ Goals │ 97.0 │ 71 │ 100.0 │ 120 │ │ │ Int64 │ ``` Using some of DataFrame functionalities ... ``` julia Movies = dataset("ggplot2", "movies") size(Movies) # (58788, 24) names(Movies) # :Title # :Year # :Length # ... # :Romance # :Short Movies2 = delete!(Movies, [:Budget, :Length, :R1, :R2, :R3, :R4, :R5, :R6, :R7, :R8, :R9, :R10, :MPAA]) head(Movies2, 3) # 3×11 DataFrames.DataFrame # │ Row │ Title │ Year │ Rating │ Votes │ Action │ Animation │ Comedy │ Drama │ Documentary │ Romance │ Short │ # ├─────┼──────────────────────────┼──────┼────────┼───────┼────────┼───────────┼────────┼───────┼─────────────┼─────────┼───────┤ # │ 1 │ $ │ 1971 │ 6.4 │ 348 │ 0 │ 0 │ 1 │ 1 │ 0 │ 0 │ 0 │ # │ 2 │ $1000 a Touchdown │ 1939 │ 6.0 │ 20 │ 0 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ # │ 3 │ $21 a Day Once a Month │ 1941 │ 8.2 │ 5 │ 0 │ 1 │ 0 │ 0 │ 0 │ 0 │ 1 │ Movies3 = stack(Movies2, 5:11) head(Movies3, 3) # 3×6 DataFrames.DataFrame # │ Row │ variable │ value │ Title │ Year │ Rating │ Votes │ # ├─────┼──────────┼───────┼────────────────────────┼──────┼────────┼───────┤ # │ 1 │ Action │ 0 │ $ │ 1971 │ 6.4 │ 348 │ # │ 2 │ Action │ 0 │ $1000 a Touchdown │ 1939 │ 6.0 │ 20 │ # │ 3 │ Action │ 0 │ $21 a Day Once a Month │ 1941 │ 8.2 │ 5 │ tail(Movies3, 3) # 3×6 DataFrames.DataFrame # │ Row │ variable │ value │ Title │ Year │ Rating │ Votes │ # ├─────┼──────────┼───────┼─────────────────────────┼──────┼────────┼───────┤ # │ 1 │ Short │ 0 │ www.hellssoapopera.com │ 1999 │ 6.6 │ 5 │ # │ 2 │ Short │ 0 │ xXx │ 2002 │ 5.5 │ 18514 │ # │ 3 │ Short │ 0 │ xXx: State of the Union │ 2005 │ 3.9 │ 1584 │ categorical!(Movies3, :variable) rename!(Movies3, :variable => :Genre) sort!(Movies3, :Year) # 411516×6 DataFrames.DataFrame # │ Row │ Genre │ value │ Title │ Year │ Rating │ Votes │ # ├────────┼──────────────┼───────┼─────────────────────────────────┼──────┼────────┼───────┤ # │ 1 │ :Action │ 0 │ Blacksmith Scene │ 1893 │ 7.0 │ 90 │ # │ 2 │ :Animation │ 0 │ Blacksmith Scene │ 1893 │ 7.0 │ 90 │ # │ 3 │ :Comedy │ 0 │ Blacksmith Scene │ 1893 │ 7.0 │ 90 │ by(Movies3[Movies3[:value] .> 0, :], :Genre) do df DataFrame(MeanRating = mean(df[:Rating]), N = size(df, 1)) end # 7×3 DataFrames.DataFrame # │ Row │ Genre │ MeanRating │ N │ # ├─────┼──────────────┼────────────┼───────┤ # │ 1 │ :Short │ 6.48142 │ 9458 │ # │ 2 │ :Documentary │ 6.65058 │ 3472 │ # │ 3 │ :Comedy │ 5.95549 │ 17271 │ # │ 4 │ :Drama │ 6.15368 │ 21811 │ # │ 5 │ :Animation │ 6.58369 │ 3690 │ # │ 6 │ :Action │ 5.29202 │ 4688 │ # │ 7 │ :Romance │ 6.164 │ 4744 │ ``` Use **Query** package ... ``` julia # Renove redundant rows Movies4 = @from i in Movies3 begin @where i.value > 0 @select i @collect DataFrame end # Group data by Year and get number of movies per year Movies4 = @from i in Movies4 begin @group i by i.Year into g @select {Year = g.key, Count = length(g)} @collect DataFrame end size(Movies4) # (113, 2) head(Movies4, 3) # 3×2 DataFrames.DataFrame # │ Row │ Year │ Count │ # ├─────┼──────┼───────┤ # │ 1 │ 1893 │ 1 │ # │ 2 │ 1894 │ 14 │ # │ 3 │ 1895 │ 5 │ ``` As we can see, Julia's DataFrame with the Query package are so powerful ... Now let's plot some plots using **Gadfly** ``` julia plot(Movies4, x = :Year, y = :Count, Geom.line) ``` ![](hiJulia_files/figure-markdown_github/line.png) It really looks like **ggplot2** !!! Let's do some more visualizations ... ``` julia diamonds = dataset("ggplot2", "diamonds") plot(diamonds, x = :Price, y = :Carat, Geom.hexbin) ``` ![](hiJulia_files/figure-markdown_github/hex.png) ``` julia plot(diamonds, x = :Price, color = :Cut, Geom.density) ``` ![](hiJulia_files/figure-markdown_github/density.png) ``` julia plot(diamonds, x = :Cut, y = :Price, Geom.boxplot) ``` ![](hiJulia_files/figure-markdown_github/box.png) And we can even do more with Gadfly ... Now as we have looked at data exploration, transformation and visualization, let's look at some important aspect of Data Science **Machine Learning** Julia has so many projects and packages to do Machine Learning and Deep Learning as well. Here we're going to talk about three of them ... First [**Knet**](https://github.com/denizyuret/Knet.jl). In Knet every model consists of a predict function and a loss function that the model tries to minimize. What I like about Knet is that it makes me specify the model/algorithm equation at first, which makes me understand the algorithms better. Let's do a linear regression model using House Price data. ``` julia ## Get the data House = readtable("home_data-train .csv", separator = ',', header = false) # remove the first two columns as they're not important delete!(House, [:x1, :x2]) # examine correlation to select variables cor(convert(Array, House)) # Get x, y excluded = [3, 7, 11, 14, 15, 16, 17, 18, 19, 20] unnecessary = [Symbol("x$i") for i in excluded] x = House[setdiff(names(House), unnecessary)] # Convert x to a matrix x = convert(Array, x)' # Scale x x = x ./ sum(x, 1) y = House[:x3]' y = log10.(y) # log y ## TRAIN THE MODEL using Knet predict(w, x) = w[1]*x .+ w[2] # linear regression equation ax + b loss(w, x, y) = mean(abs2, y-predict(w, x)) # mean error lossgradient = grad(loss) # lossgradient returns dw, the gradient of the loss function train(w, data; lr=.1) # lr learning rate for (x,y) in data dw = lossgradient(w, x, y) for i in 1:length(w) w[i] -= lr * dw[i] end end return w end w = Any[0.1 * randn(1, 9), 0.0 ] # 9 variables for i = 1:10; train(w, [(x, y)]); println(loss(w, x, y)); end # 15.747867259203675 # 7.728037219389701 # ..... # 0.19558043796561383 # 0.1697502328566464 ``` Look how the loss was being minimized throughout the learning process. It can look a little bit confusing but believe me it's straightforward. Try to look at the Knet project tutorials and it will become clearer. Now let's evaluate the model comparing the predicted values with the actual one. ``` julia ## First let's look at every variable cofficient and the intercept w[1] # 1×9 Array{Float64,2}: # -0.0188016 -0.148975 0.810052 0.0934263 0.149106 0.100036 -0.0526069 0.430641 2.39719 w[2] # 3.6645523063680914 ## Now the actual y y # 5.34616 5.73078 5.25527 5.78104 5.70757 ... yhat = w[1] * x .+ w[2] # 5.53536 5.38548 5.77452 5.41222 5.50971 ... #### N.B. they're logged numbers to return them to actual do (10 ^ y) ``` The model looks good ... ###### For more information on Linear Regression on House Pricing Data, Visit No let's look at **CLUSTERING** and how it can be done with Julia. ``` julia Animals = dataset("MASS", "Animals") using Clustering feature_matrix = permutedims(convert(Array{Float32}, Animals[:, 2:3]), [2, 1]) # to convert it to matrix model = kmeans(feature_matrix, 3) # 3 clusters ## Plotting clusters plot(Animals, x = :Brain, y = :Body, color = categorical(model.assignments), label = :Species, Geom.point, Geom.label, Scale.x_log10, Scale.y_log10) ``` ![](hiJulia_files/figure-markdown_github/cluster.png) This is straightforward ... There are so many other packages that can be used to do Machine Learning with Julia, such as [**DecisionTree**](https://github.com/bensadeghi/DecisionTree.jl), [**ScikitLearn.jl**](https://scikitlearnjl.readthedocs.io/en/latest/), [**Flux**](https://github.com/FluxML/Flux.jl) and others. Now we have looked at a little bit of what Julia can do in Data Science. Julia is a great language and its suntax is so beautiful and it is so powerful in doing math operations. It is almost as fast as C++. And above all it is easy to learn as it resembles R and Python in so many aspects. Finally, **JULIA** is a great tool to add to your Data Science toolkit ...

近期下载者

相关文件


收藏者