Title: | Interface for 'GraphFrames' |
---|---|
Description: | A 'sparklyr' <https://spark.rstudio.com/> extension that provides an R interface for 'GraphFrames' <https://graphframes.github.io/>. 'GraphFrames' is a package for 'Apache Spark' that provides a DataFrame-based API for working with graphs. Functionality includes motif finding and common graph algorithms, such as PageRank and Breadth-first search. |
Authors: | Kevin Kuo [aut, cre] |
Maintainer: | Kevin Kuo <[email protected]> |
License: | Apache License 2.0 | file LICENSE |
Version: | 0.1.2 |
Built: | 2024-10-24 04:14:00 UTC |
Source: | https://github.com/rstudio/graphframes |
Breadth-first search (BFS)
gf_bfs(x, from_expr, to_expr, max_path_length = 10, edge_filter = NULL, ...)
gf_bfs(x, from_expr, to_expr, max_path_length = 10, edge_filter = NULL, ...)
x |
An object coercable to a GraphFrame (typically, a
|
from_expr |
Spark SQL expression specifying valid starting vertices for the BFS. |
to_expr |
Spark SQL expression specifying valid target vertices for the BFS. |
max_path_length |
Limit on the length of paths. |
edge_filter |
Spark SQL expression specifying edges which may be used in the search. |
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_bfs(g, from_expr = "name = 'Esther'", to_expr = "age < 32") ## End(Not run)
## Not run: g <- gf_friends(sc) gf_bfs(g, from_expr = "name = 'Esther'", to_expr = "age < 32") ## End(Not run)
Cache the GraphFrame
gf_cache(x)
gf_cache(x)
x |
An object coercable to a GraphFrame (typically, a
|
Returns a chain graph of the given size with Long ID type. The vertex IDs are 0, 1, ..., n-1, and the edges are (0, 1), (1, 2), ...., (n-2, n-1).
gf_chain(sc, n)
gf_chain(sc, n)
sc |
A Spark connection. |
n |
Size of the graph to return. |
## Not run: gf_chain(sc, 5) ## End(Not run)
## Not run: gf_chain(sc, 5) ## End(Not run)
Computes the connected component membership of each vertex and returns a DataFrame of vertex information with each vertex assigned a component ID.
gf_connected_components(x, broadcast_threshold = 1000000L, algorithm = c("graphframes", "graphx"), checkpoint_interval = 2L, ...)
gf_connected_components(x, broadcast_threshold = 1000000L, algorithm = c("graphframes", "graphx"), checkpoint_interval = 2L, ...)
x |
An object coercable to a GraphFrame (typically, a
|
broadcast_threshold |
Broadcast threshold in propagating component assignments. |
algorithm |
One of 'graphframes' or 'graphx'. |
checkpoint_interval |
Checkpoint interval in terms of number of iterations. |
... |
Optional arguments, currently not used. |
## Not run: # checkpoint directory is required for gf_connected_components() spark_set_checkpoint_dir(sc, tempdir()) g <- gf_friends(sc) gf_connected_components(g) ## End(Not run)
## Not run: # checkpoint directory is required for gf_connected_components() spark_set_checkpoint_dir(sc, tempdir()) g <- gf_friends(sc) gf_connected_components(g) ## End(Not run)
Degrees of vertices
gf_degrees(x)
gf_degrees(x)
x |
An object coercable to a GraphFrame (typically, a
|
Edges column names
gf_edge_columns(x)
gf_edge_columns(x)
x |
An object coercable to a GraphFrame (typically, a
|
Extract edges DataFrame
gf_edges(x)
gf_edges(x)
x |
An object coercable to a GraphFrame (typically, a
|
Motif finding uses a simple Domain-Specific Language (DSL) for expressing structural queries. For example, gf_find(g, "(a)-[e]->(b); (b)-[e2]->(a)") will search for pairs of vertices a,b connected by edges in both directions. It will return a DataFrame of all such structures in the graph, with columns for each of the named elements (vertices or edges) in the motif. In this case, the returned columns will be in order of the pattern: "a, e, b, e2."
gf_find(x, pattern)
gf_find(x, pattern)
x |
An object coercable to a GraphFrame (typically, a
|
pattern |
pattern specifying a motif to search for |
## Not run: gf_friends(sc) %>% gf_find("(a)-[e]->(b); (b)-[e2]->(a)") ## End(Not run)
## Not run: gf_friends(sc) %>% gf_find("(a)-[e]->(b); (b)-[e2]->(a)") ## End(Not run)
Graph of friends in a social network.
gf_friends(sc)
gf_friends(sc)
sc |
A Spark connection. |
## Not run: library(sparklyr) sc <- spark_connect(master = "local") gf_friends(sc) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local") gf_friends(sc) ## End(Not run)
Create a new GraphFrame
gf_graphframe(vertices = NULL, edges)
gf_graphframe(vertices = NULL, edges)
vertices |
A |
edges |
A |
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.3.0") v_tbl <- sdf_copy_to( sc, data.frame(id = 1:3, name = LETTERS[1:3]) ) e_tbl <- sdf_copy_to( sc, data.frame(src = c(1, 2, 2), dst = c(2, 1, 3), action = c("love", "hate", "follow")) ) gf_graphframe(v_tbl, e_tbl) gf_graphframe(edges = e_tbl) ## End(Not run)
## Not run: library(sparklyr) sc <- spark_connect(master = "local", version = "2.3.0") v_tbl <- sdf_copy_to( sc, data.frame(id = 1:3, name = LETTERS[1:3]) ) e_tbl <- sdf_copy_to( sc, data.frame(src = c(1, 2, 2), dst = c(2, 1, 3), action = c("love", "hate", "follow")) ) gf_graphframe(v_tbl, e_tbl) gf_graphframe(edges = e_tbl) ## End(Not run)
Generate a grid Ising model with random parameters
gf_grid_ising_model(sc, n, v_std = 1, e_std = 1)
gf_grid_ising_model(sc, n, v_std = 1, e_std = 1)
sc |
A Spark connection. |
n |
Length of one side of the grid. The grid will be of size n x n. |
v_std |
Standard deviation of normal distribution used to generate vertex factors "a". Default of 1.0. |
e_std |
Standard deviation of normal distribution used to generate edge factors "b". Default of 1.0. |
This method generates a grid Ising model with random parameters. Ising models are probabilistic graphical models over binary variables xi. Each binary variable xi corresponds to one vertex, and it may take values -1 or +1. The probability distribution P(X) (over all xi) is parameterized by vertex factors ai and edge factors bij:
GraphFrame. Vertices have columns "id" and "a". Edges have columns "src", "dst", and "b". Edges are directed, but they should be treated as undirected in any algorithms run on this model. Vertex IDs are of the form "i,j". E.g., vertex "1,3" is in the second row and fourth column of the grid.
## Not run: gf_grid_ising_model(sc, 5) ## End(Not run)
## Not run: gf_grid_ising_model(sc, 5) ## End(Not run)
In-degrees of vertices
gf_in_degrees(x)
gf_in_degrees(x)
x |
An object coercable to a GraphFrame (typically, a
|
Run static Label Propagation for detecting communities in networks. Each node in the network is initially assigned to its own community. At every iteration, nodes send their community affiliation to all neighbors and update their state to the mode community affiliation of incoming messages. LPA is a standard community detection algorithm for graphs. It is very inexpensive computationally, although (1) convergence is not guaranteed and (2) one can end up with trivial solutions (all nodes are identified into a single community).
gf_lpa(x, max_iter, ...)
gf_lpa(x, max_iter, ...)
x |
An object coercable to a GraphFrame (typically, a
|
max_iter |
Maximum number of iterations. |
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_lpa(g, max_iter = 5) ## End(Not run)
## Not run: g <- gf_friends(sc) gf_lpa(g, max_iter = 5) ## End(Not run)
Out-degrees of vertices
gf_out_degrees(x)
gf_out_degrees(x)
x |
An object coercable to a GraphFrame (typically, a
|
PageRank
gf_pagerank(x, tol = NULL, reset_probability = 0.15, max_iter = NULL, source_id = NULL, ...)
gf_pagerank(x, tol = NULL, reset_probability = 0.15, max_iter = NULL, source_id = NULL, ...)
x |
An object coercable to a GraphFrame (typically, a
|
tol |
Tolerance. |
reset_probability |
Reset probability. |
max_iter |
Maximum number of iterations. |
source_id |
(Optional) Source vertex for a personalized pagerank. |
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_pagerank(g, reset_probability = 0.15, tol = 0.01) ## End(Not run)
## Not run: g <- gf_friends(sc) gf_pagerank(g, reset_probability = 0.15, tol = 0.01) ## End(Not run)
Persist the GraphFrame
gf_persist(x, storage_level = "MEMORY_AND_DISK")
gf_persist(x, storage_level = "MEMORY_AND_DISK")
x |
An object coercable to a GraphFrame (typically, a
|
storage_level |
The storage level to be used. Please view the Spark Documentation for information on what storage levels are accepted. |
Register a GraphFrame object
gf_register(x)
gf_register(x)
x |
An object coercable to a GraphFrame (typically, a
|
Compute the strongly connected component (SCC) of each vertex and return a DataFrame with each vertex assigned to the SCC containing that vertex.
gf_scc(x, max_iter, ...)
gf_scc(x, max_iter, ...)
x |
An object coercable to a GraphFrame (typically, a
|
max_iter |
Maximum number of iterations. |
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_scc(g, max_iter = 10) ## End(Not run)
## Not run: g <- gf_friends(sc) gf_scc(g, max_iter = 10) ## End(Not run)
Computes shortest paths from every vertex to the given set of landmark vertices. Note that this takes edge direction into account.
gf_shortest_paths(x, landmarks, ...)
gf_shortest_paths(x, landmarks, ...)
x |
An object coercable to a GraphFrame (typically, a
|
landmarks |
IDs of landmark vertices. |
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_shortest_paths(g, landmarks = c("a", "d")) ## End(Not run)
## Not run: g <- gf_friends(sc) gf_shortest_paths(g, landmarks = c("a", "d")) ## End(Not run)
Returns a star graph with Long ID type, consisting of a central element indexed 0 (the root) and the n other leaf vertices 1, 2, ..., n.
gf_star(sc, n)
gf_star(sc, n)
sc |
A Spark connection. |
n |
The number of leaves. |
## Not run: gf_star(sc, 5) ## End(Not run)
## Not run: gf_star(sc, 5) ## End(Not run)
This algorithm ignores edge direction; i.e., all edges are treated as undirected. In a multigraph, duplicate edges will be counted only once.
gf_triangle_count(x, ...)
gf_triangle_count(x, ...)
x |
An object coercable to a GraphFrame (typically, a
|
... |
Optional arguments, currently not used. |
## Not run: g <- gf_friends(sc) gf_triangle_count(g) ## End(Not run)
## Not run: g <- gf_friends(sc) gf_triangle_count(g) ## End(Not run)
Triplets of graph
gf_triplets(x)
gf_triplets(x)
x |
An object coercable to a GraphFrame (typically, a
|
Two densely connected blobs (vertices 0->n-1 and n->2n-1) connected by a single edge (0->n).
gf_two_blobs(sc, blob_size)
gf_two_blobs(sc, blob_size)
sc |
A Spark connection. |
blob_size |
The size of each blob. |
## Not run: gf_two_blobs(sc, 3) ## End(Not run)
## Not run: gf_two_blobs(sc, 3) ## End(Not run)
Unpersist the GraphFrame
gf_unpersist(x, blocking = FALSE)
gf_unpersist(x, blocking = FALSE)
x |
An object coercable to a GraphFrame (typically, a
|
blocking |
whether to block until all blocks are deleted |
Vertices column names
gf_vertex_columns(x)
gf_vertex_columns(x)
x |
An object coercable to a GraphFrame (typically, a
|
Extract vertices DataFrame
gf_vertices(x)
gf_vertices(x)
x |
An object coercable to a GraphFrame (typically, a
|
Retrieve a GraphFrame
spark_graphframe(x, ...) spark_graphframe(x, ...)
spark_graphframe(x, ...) spark_graphframe(x, ...)
x |
An object coercable to a GraphFrame (typically, a
|
... |
additional arguments, not used |