link-prediction-master

所属分类:人工智能/神经网络/深度学习
开发工具:Python
文件大小:22KB
下载次数:0
上传日期:2019-05-14 21:37:47
上 传 者phupham
说明:  link prediction model for network analysis

文件列表:
GraphSimilarityFeatures.scala (3984, 2012-07-31)
PersonalizedPageRank.scala (3769, 2012-07-31)
PersonalizedPageRankInScalding.scala (3559, 2012-07-31)
PropagationScore.scala (5157, 2012-07-31)
Scorer.scala (2558, 2012-07-31)
calculate_dataset_distributions.rb (727, 2012-07-31)
find_needy_nodes.rb (1045, 2012-07-31)
predict_links.py (2808, 2012-07-31)

A couple weeks ago, Facebook launched a [link prediction contest on Kaggle](http://www.kaggle.com/c/FacebookRecruiting/), with the goal of recommending missing edges in a social graph. [I love investigating social networks](http://blog.echen.me/2011/09/07/information-transmission-in-a-social-network-dissecting-the-spread-of-a-quora-post/), so I dug around a little, and since I did well enough to score one of the coveted prizes, I'll share my approach here. (For a bit of background, the contest provided a training dataset of edges, a test set of nodes, and contestants were asked to predict missing outbound edges on the test set, using [mean average precision](http://www.kaggle.com/c/FacebookRecruiting/details/Evaluation) as the evaluation metric.) # Exploring What does the network actually look like? I wanted to play around with the data a bit first, to get a rough feel, so I made [an app](http://link-prediction.herokuapp.com/network) to interact with the local follow network around each node. [![1 Untrimmed Network](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_untrimmed.png)](http://link-prediction.herokuapp.com/network) (Go ahead, click on the picture to [play with the app yourself](http://link-prediction.herokuapp.com/network). It's pretty fun.) The node in black is a selected central node from the training set, and we perform a breadth-first walk of the graph out to a maximum distance of 3. Nodes are sized according to their distance from the center, and colored according to a chosen metric (a personalized PageRank in this case, but more on this later). We can see that the central node is friends with three other users, two of whom (on the top-left and bottom) have fairly large, disjoint networks. There are a bunch of dangling nodes (nodes at distance 3 with only one connection to the rest of the local network) that aren't contributing much to the picture, though, so let's remove these to reveal the core structure: [![1 Untrimmed Network](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_network.png)](http://link-prediction.herokuapp.com/network) Since the default view doesn't encode the distinction between following and follower relationships, we can mouse over each node to see who it follows and who it's followed by. Here, for example, is the following/follower network of one of the central node's friends: [![1 - Friend1](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend1.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend1.png) The moused over node is highlighted in black, its friends (users who both follow the node and are followed back in turn) are colored in purple, its followees are in teal, and its followers are in orange. We can see that this node has a good mix of all three types, and also that it too is friends with one of the central node's friends ([triadic closure](http://en.wikipedia.org/wiki/Triadic_closure), unite). Here's another following/follower network, this time of the central node's friend at the bottom: [![1 - Friend2](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend2.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_friend2.png) Interestingly, while the first friend had several followers-only (in orange), the second friend has none. (Which suggests, perhaps, a new type of follow-hungry feature...) And here's one more node, a little farther out, which has nothing but followers (a celebrity, perhaps?!): [![1 - Celebrity](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_celebrity.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/1_celebrity.png) ## The Quiet One Let's take a look at another graph, one whose local network is a little smaller: [![4 Network](https://dl.dropbox.com/u/10506/blog/kaggle-fb/4_network.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/4_network.png) ## A Social Butterfly And one more, one whose local network is a little larger: [![2 Network](https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_network.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_network.png) [![2 Network - Friend](https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_friend.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/2_friend.png) Again, I encourage everyone to play around with the app [here](http://link-prediction.herokuapp.com/network), and we'll revisit the question of coloring each node later. # Distributions I also wanted to take a more quantitative look at the graph. Here's the distribution of the number of followers of each node in the training set (cut off at 50 followers to better see the graph -- the maximum number of followers is at 552), as well as the number of users each node is following (again, cut off at 50 -- the maximum here is 1566) [![Training](https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_dist.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/training_dist.png) (Nothing terribly surprising here, but that alone is good to verify. Note that based on these counts alone, and the lack of Bieberish nodes, the unknown social network does seem to consist of more local connections than a network like Twitter. For people tempted to mutter about power laws, I'll hold you off with the bitter coldness of [baby Gauss's tears](http://cscs.umich.edu/~crshalizi/weblog/491.html).) Similarly, here are the same two graphs, but limited to the nodes in the test set alone: [![Test](https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_dist.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/test_dist.png) (Notice that there are relatively more test set users with 0 followees than in the full training set, and relatively fewer test set users with 0 followers. This information could be used to better simulate a validation set for model selection, though I didn't end up doing this myself.) # Preliminary Probes Finally, let's move on to the models themselves. In order to quickly get up and running on a couple prediction algorithms, I started with some unsupervised approaches. For example, after building a new validation set to test performance offline, I tried: * Recommending users who follow you (but you don't follow in return) * Recommending users similar to you (when representing users as sets of their followers, and using cosine similarity and Jaccard similarity as the similarity metric) * Recommending users based on a personalized PageRank score * Recommending users that the people you follow also follow And so on, combining the votes of these algorithms in a fairly ad-hoc way (e.g., by taking the majority vote or by ordering by the number of followers). This worked quite well actually, but I'd been planning to move on to a more machine learned model-based approach from the beginning, so I did that next. *My validation set was formed by deleting random edges from the full training set. A slightly better approach, as mentioned above, might have been to more accurately simulate the distribution of the official test set, but I didn't end up trying this out myself. # Candidate Selection In order to run a machine learning algorithm to recommend edges (which would take two nodes, a source and a candidate destination, and generate a score measuring the likelihood that the source would follow the destination), it was necessary to prune the set of candidates to run the algorithm on. I used two approaches for this filtering step, both based on random walks on the graph. ## Personalized PageRank The first approach was to calculate a personalized PageRank around each source node. Briefly, a personalized PageRank is just like standard PageRank, except that when randomly teleporting to a new node, the surfer always teleports back to the given source node being personalized (rather than to a node chosen uniformly at random, as in the standard PageRank algorithm). That is, the random surfer in the personalized PageRank model works as follows: * He starts at the source node X that we want to calculate a personalized PageRank around. * At step $i$: with probability $p$, the surfer moves to a neighboring node chosen uniformly at random; with probability $1-p$, the surfer instead teleports back to the original source node X. * The limiting probability that the surfer is at node N is then the personalized PageRank score of node N around X. Here's some Scala code that computes (approximate) personalized PageRank scores and takes the highest-scoring nodes as the candidates to feed into the machine learning model: /** * Calculate a personalized PageRank around the given user, and return a list of the * nodes with the highest personalized PageRank scores. * * @return A list of (node, probability of landing at this node after running a personalized * PageRank for K iterations) pairs. */ def pageRank(user: Int): List[(Int, Double)] = { // This map holds the probability of landing at each node, up to the current iteration. val probs = Map[Int, Double]() probs(user) = 1 // We start at this user. val pageRankProbs = pageRankHelper(start, probs, NumPagerankIterations) pageRankProbs.toList .sortBy { -_._2 } .filter { case (node, score) => !getFollowings(user).contains(node) && node != user } .take(MaxNodesToKeep) } /** * Simulates running a personalized PageRank for one iteration. * * Parameters: * start - the start node to calculate the personalized PageRank around * probs - a map from nodes to the probability of being at that node at the start of the current * iteration * numIterations - the number of iterations remaining * alpha - with probability alpha, we follow a neighbor; with probability 1 - alpha, we teleport * back to the start node * * @return A map of node -> probability of landing at that node after the specified number of iterations. */ def pageRankHelper(start: Int, probs: Map[Int, Double], numIterations: Int, alpha: Double = 0.5): Map[Int, Double] = { if (numIterations <= 0) { probs } else { // This map holds the updated set of probabilities, after the current iteration. val probsPropagated = Map[Int, Double]() // With probability 1 - alpha, we teleport back to the start node. probsPropagated(start) = 1 - alpha // Propagate the previous probabilities... probs.foreach { case (node, prob) => val forwards = getFollowings(node) val backwards = getFollowers(node) // With probability alpha, we move to a follower... // And each node distributes its current probability equally to its neighbors. val probToPropagate = alpha * prob / (forwards.size + backwards.size) (forwards.toList ++ backwards.toList).foreach { neighbor => if (!probsPropagated.contains(neighbor)) { probsPropagated(neighbor) = 0 } probsPropagated(neighbor) += probToPropagate } } pageRankHelper(start, probsPropagated, numIterations - 1, alpha) } } ## Propagation Score Another (similar) approach I used, based on [a proposal by another contestant on the Kaggle forums](http://www.kaggle.com/c/FacebookRecruiting/forums/t/2082/0-711-is-the-new-0), works as follows: * Start at a specified user node and give it some score. * In the first iteration, this user propagates its score equally to its neighbors. * In the second iteration, each user duplicates and keeps half of its score S. It then propagates S equally to its neighbors. * In subsequent iterations, the process is repeated, except that neighbors reached via a backwards link don't duplicate and keep half of their score. (The idea being that we want the score to reach followees and not followers.) Here's some Scala code to calculate these propagation scores: /** * Calculate propagation scores around the current user. * * In the first propagation round, we * * - Give the starting node N an initial score S. * - Propagate the score equally to each of N's neighbors (followers * and followings). * - Each first-level neighbor then duplicates and keeps half of its score * and then propagates the original again to its neighbors. * * In further rounds, neighbors then repeat the process, except that neighbors traveled to * via a backwards/follower link don't keep half of their score. * * @return a sorted list of (node, propagation score) pairs. */ def propagate(user: Int): List[(Int, Double)] = { val scores = Map[Int, Double]() // We propagate the score equally to all neighbors. val scoreToPropagate = 1.0 / (getFollowings(user).size + getFollowers(user).size) (getFollowings(user).toList ++ getFollowers(user).toList).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, 1) // ...and make sure it keeps half of it for itself. scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2 } scores.toList.sortBy { -_._2 } .filter { nodeAndScore => val node = nodeAndScore._1 !getFollowings(user).contains(node) && node != user } .take(MaxNodesToKeep) } /** * In further rounds, neighbors repeat the process above, except that neighbors traveled to * via a backwards/follower link don't keep half of their score. */ def continuePropagation(scores: Map[Int, Double], user: Int, score: Double, currIteration: Int): Unit = { if (currIteration < NumIterations && score > 0) { val scoreToPropagate = score / (getFollowings(user).size + getFollowers(user).size) getFollowings(user).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, currIteration + 1) // ...and make sure it keeps half of it for itself. scores(x) = scores.getOrElse(x, 0: Double) + scoreToPropagate / 2 } getFollowers(user).foreach { x => // Propagate the score... continuePropagation(scores, x, scoreToPropagate, currIteration + 1) // ...but backward links (except for the starting node's immediate neighbors) // don't keep any score for themselves. } } } I played around with tweaking some parameters in both approaches (e.g., weighting followers and followees differently), but the natural defaults (as used in the code above) ended up performing the best. # Features After pruning the set of candidate destination nodes to a more feasible level, I fed pairs of (source, destination) nodes into a machine learning model. From each pair, I extracted about 30 features in total. As mentioned above, one feature that worked quite well on its own was whether the destination node already follows the source. I also used a wide set of similarity-based features, for example, the Jaccard similarity between the source and destination when both are represented as sets of their followers, when both are represented as sets of their followees, or when one is represented as a set of followers while the other is represented as a set of followees. abstract class SimilarityMetric[T] { def apply(set1: Set[T], set2: Set[T]): Double; } object JaccardSimilarity extends SimilarityMetric[Int] { /** * Returns the Jaccard similarity between two sets, 0 if both are empty. */ def apply(set1: Set[Int], set2: Set[Int]): Double = { val union = (set1.union(set2)).size if (union == 0) { 0 } else { (set1 & set2).size.toFloat / union } } } object CosineSimilarity extends SimilarityMetric[Int] { /** * Returns the cosine similarity between two sets, 0 if both are empty. */ def apply(set1: Set[Int], set2: Set[Int]): Double = { if (set1.size == 0 && set2.size == 0) { 0 } else { (set1 & set2).size.toFloat / (math.sqrt(set1.size * set2.size)) } } } // ************ // * FEATURES * // ************ /** * Returns the similarity between user1 and user2 when both are represented as * sets of followers. */ def similarityByFollowers(user1: Int, user2: Int)(implicit similarity: SimilarityMetric[Int]): Double = { similarity.apply(getFollowersWithout(user1, user2), getFollowersWithout(user2, user1)) } // etc. Along the same lines, I also computed a similarity score between the destination node and the source node's followees, and several variations thereof. /** * Iterate over each of user1's followings, compute their similarity with user2 * when both are represented as sets of followers, and return the sum of these * similarities. */ def followerBasedSimilarityToFollowing(user1: Int, user2: Int)(implicit similarity: SimilarityMetric[Int]): Double = { getFollowingsWithout(user1, user2) .map { similarityByFollowers(_, user2)(similarity) } .sum } Other features included the number of followers and followees of each node, the ratio of these, the personalized PageRank and propagation scores themselves, the number of followers in common, and triangle/closure-type features (e.g., whether the source node is friends with a node X who in turn is a friend of the destination node). If I had had more time, I would probably have tried weighted and more regularized versions of some of these features as well (e.g., downweighting nodes with large numbers of followers when computing cosine similarity scores based on followees, or shrinking the scores of nodes we have little information about), but I didn't get a chance this time around. # Feature Understanding But what are these features actually *doing*? As much as I enjoy black-box models, I'm an insights guy, so I'm even more interested in what my models are doing and *why*. So let's use the same app I built before to take a look. Here's the local network of node 317 (different from the node above), where each node is colored by its personalized PageRank (higher scores are in darker red): [![317 - Personalized PageRank](https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_propagation.png) If we look at the following vs. follower relationships of the central node (recall that purple is friends, teal is followings, orange is followers): [![317 - Personalized PageRank](https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_following_followers.png)](https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_following_followers.png) ...we can see that, as expected (because I double-weighted edges in my personalized PageRank calculation that represented both following and follower), the darkest red nodes are those that are friends of the central node, while those in a following-only or follower-only relationship have a lower score. Now how does my propagation score compare to personalized PageRank? Here I colored each node according to the log ratio of its propagation score and personalized PageRank: [![317 - Log Ratio](https://dl.dropbox.com/u/10506/blog/kaggle-fb/317_log_ratio.png)](https://dl.dropbox.com/u/ ... ...

近期下载者

相关文件


收藏者