このページは http://www.slideshare.net/slidarko/gremlin-a-graphbased-programming-language-3876581 の内容を掲載しています。

掲載を希望されないスライド著者の方は、こちらよりご連絡下さい。

6年以上前 (2010/04/27)にアップロードinテクノロジー

Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi...

Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.

- Gremlin

G = (V, E)

A Graph-Based Programming Language

Marko A. Rodriguez

T-5, Center for Nonlinear Studies

Los Alamos National Laboratory

http://markorodriguez.com

http://gremlin.tinkerpop.com

February 25, 2010 - Abstract

Gremlin is a Turing-complete, graph-based programming language

developed for key/value-pair multi-relational graphs called property graphs.

Gremlin makes extensive use of XPath 1.0 to support complex graph

traversals. Connectors exist to various graph databases and frameworks.

This language has application in the areas of graph query, analysis, and

manipulation.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Acknowledgements

• Marko A. Rodriguez [http://markorodriguez.com]

designed, developed, tested, and documented Gremlin.

• Peter Neubauer [http://www.linkedin.com/in/neubauer]

aided in the design and the evangelizing of Gremlin.

• Pavel Yaskevich [http://github.com/xedin]

aided in the development of user defined functions in Gremlin.

• Joshua Shinavier [http://fortytwo.net]

provided initial conceptual support for Gremlin.

• Ketrina Yim [http://csillustrated.berkeley.edu]

designed the logo for Gremlin.

• Gremlin-Users Group [http://groups.google.com/group/gremlin-users]

provided much direction in the design and implementation of Gremlin.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Outline

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

• Conclusions

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Outline

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

• Conclusions

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - What is a Graph?

• A graph (network) is composed of a collection of vertices (dots) and edges (lines).

There are many types of graphs: directed/undirected, weighted, attributed, etc.

vertex-labeled

a

hyper

edge-attributed

0.2

created=2-01-09

knows

edge-labeled

multi

weighted

modified=2-11-09

directed

hired

regular

semantic

undirected

pseudo

half-edge

http://ex.com/123

type="person"

name="emil"

resource description framework

vertex-attributed

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Why Use a Graph?

• A graph is a very general data structure that can be used to model

various systems.

A graph can model the structure of transportation, technological,

bibliographic, etc. systems.

A graph can model a list, a map, a tree, etc.

• There are numerous graph algorithms that are defined independent of

the domain of the graph model.

• There are numerous graph databases, frameworks, packages, etc.

that aid in the creation, manipulation, and analysis of graphs.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Graph Databases, Frameworks, and Packages

• Neo4j Graph Database [http://neo4j.org]

• AllegroGraph Quad Store [http://http://www.franz.com/agraph]

• HyperGraphDB [http://www.kobrix.com/hgdb.jsp]

• Java Universal Network/Graph Framework [http://jung.sourceforge.net]

• OpenRDF Sesame Framework [http://www.openrdf.org]

• InfoGrid Graph Database [http://infogrid.org]

• Filament Graph Toolkit [http://filament.sourceforge.net]

• OWLim Semantic Repository [http://www.ontotext.com/owlim]

• Sones Graph Database [http://www.sones.com]

• NetworkX Graph Toolkit [http://networkx.lanl.gov]

• iGraph Toolkit [http://igraph.sourceforge.net]

• Blueprints Graph API [http://blueprints.tinkerpop.com]

• ... and many more.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - What Makes Gremlin Different?

• Gremlin is a domain specific language for working with graphs.

• Gremlin is not an application programming interface (API).

• Gremlin makes use of various graph databases, frameworks, packages.

• Gremlin is a language that currently has a virtual

machine

implementation written in Java.

• What can be succinctly expressed in Gremlin is verbose/clumsy to

express in general purpose languages such as Java, Python, Ruby, etc.

• Gremlin allows one to map single-relational graph analysis algorithms

over to the multi-relational domain.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Single-Relational Graphs

• In single-relational graphs, all edges have the same meaning

(e.g. all edges are either frienship, kinship, worksWith, knows, etc.).

G = (V, E ⊆ (V × V ))

• Most

graph

algorithms

are

defined

for

single-relational

graphs

(e.g. centrality/ranking, clustering/community detection, etc.).

person-c

person-a

person-b

NOTE: These types of graphs are also known as directed, vertex-labeled graphs.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Multi-Relational Graphs

• In multi-relational graphs, edges can have different meanings.

G = (V, E ⊂ (V × V ), ω : E → Σ∗)

• Most graph software is designed for multi-relational graphs (e.g. arbitrary

objects as vertices and edges, knowledge-based reasoning systems, etc.).

book-c

read

cites

person-a

authored

book-b

NOTE: These types of graphs are also known as directed, vertex/edge-labeled graphs.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Gremlin and Multi-Relational Graphs

• Gremlin provides a means to elegantly map single-relational graph

analysis algorithms over to the multi-relational graph domain.

• Gremlin provides an elegant way to do automated reasoning in

multi-relational graphs using path expressions.

These two points form the primary thesis of this presentation.

Rodriguez M.A., Shinavier, J., “Exposing Multi-Relational Networks to Single-Relational Network Analysis

Algorithms,” Journal of Informetrics, 4(1), 29–41, doi:10.1016/j.joi.2009.06.004, LA-UR-08-03931,

http://arxiv.org/abs/0806.2274, December 2009.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Property Graphs

• Gremlin works with a type of multi-relational graph called a property

graph.

Vertices and edges are labeled with unique identifiers.

Edges are directed, labeled, and can form loops.

Multiple edges of the same label can exist for the same vertex pair.

Vertices and edges can have any number of key/value pair

properties/attributes.

Property graphs are a relatively general graph structure that can be constrained to model other graph

structures — though, a property-based hypergraph would be the most general (see HyperGraphDB and the

JUNG API).

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Property Graphs

name = "lop"

lang = "java"

weight = 0.4

3

name = "marko"

age = 29

created

weight = 0.2

9

1

created

created

8

12

7

weight = 1.0

weight = 0.4

6

weight = 0.5

knows

knows

11

name = "peter"

age = 35

name = "josh"

4

age = 32

2

10

name = "vadas"

age = 27

weight = 1.0

created

5

name = "ripple"

lang = "java"

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Outline

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Gremlin System Architecture

• The Gremlin console is a scripting environment

Gremlin

Gremlin

which allows for the dynamic evaluation of

Console

ScriptEngine

Gremlin code.

• Gremlin implements JSR 223 which allows

Gremlin to also be used within the Java

language and thus, as a virtual machine directly

accessible to Java applications.

Popular JSR

223 implementations include Jython, JRuby, and

Groovy. For a fine list of implementations see

https://scripting.dev.java.net.

• Blueprints is a set of interfaces for abstract

data structures such as graphs and documents.

Implementations to these interfaces exist for

various data management systems.

• There exist many graph data management

systems that span various graph data models

Neo4j

NativeStore

TinkerGraph

(e.g.

edge

labeled

graphs,

RDF

graphs,

hypergraphs, etc.).

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - “Hello World” in the Gremlin Console

marko$ ./gremlin.sh

\,,,/

(o o)

-----oOOo-(_)-oOOo-----

gremlin>

gremlin> concat(‘goodbye’, ‘ ’, ‘self’)

==>goodbye self

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Simple Traversals in Gremlin

name = "lop"

gremlin> $_ := g:key(‘name’,‘marko’)

lang = "java"

==>v[1]

weight = 0.4

3

name = "marko"

gremlin> .

age = 29

created

9

==>v[1]

1

created

created

8

12

gremlin> ./outE

7

6

weight = 0.5

==>e[7][1-knows->2]

knows

knows

11

==>e[9][1-created->3]

weight = 1.0

name = "josh"

4

age = 32

==>e[8][1-knows->4]

2

10

gremlin> ./outE/@weight

name = "vadas"

age = 27

==>0.5

created

==>0.4

==>1.0

5

./outE/@weight: “Get the current object(s). Then get the outgoing edges of those objects. Then get the

weights of those edges.”

$ is a reserved variable meaning the root list of objects.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Simple Traversals in Gremlin

name = "lop"

gremlin> .

lang = "java"

==>v[1]

3

name = "marko"

gremlin> ./outE[@label=‘created’]/inV

age = 29

created

==>v[3]

9

1

created

gremlin> $_ := $_last

created

8

12

7

==>v[3]

6

knows

gremlin> ./@name

knows

11

==>lop

4

2

gremlin> g:map(.)

10

==>name=lop

created

==>lang=java

5

./outE[@label=‘created’]/inV: “Get the current object(s). Then get the outgoing edges of those

objects, where their labels equal ‘created’. Then get the incoming vertices of those ‘created’ edges.”

$ last is a reserved variable meaning the last value evaluated.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Simple Traversals in Gremlin

name = "lop"

lang = "java"

3

name = "marko"

age = 29

created

9

1

created

created

8

12

7

6

knows

knows

11

name = "josh"

4

age = 32

2

10

name = "vadas"

age = 27

created

5

./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name

==>vadas

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Simple Traversals in Gremlin

./outE[@label=‘knows’]/inV[matches(@name,‘va.{3}’) and @age > 21]/@name

1. .: Get the current object(s).

2. outE[@label=‘knows’]:

Get the outgoing edges of the current

object(s), where their labels equal ‘knows’.

3. inV[matches(@name,‘va.{3}’) and @age > 21]: Get the incoming

vertices of those ‘knows’ edges, where the names of those vertices are 5

characters long, start with ‘va’, and whose age is greater than 21.

4. @name: get the name of those particular incoming vertices.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Knowledge-Based Reasoning

• Blueprints implements the Sesame SAIL interfaces and thus, Gremlin

can be used over the many Resource Description Framework (RDF)

triple/quad stores. In such cases, RDF is modeled as a property graph

where the named graph component is the @ng edge property.

• Gremlin makes use of the Sesame SAIL SPARQL engine to allow for

queries based on graph-pattern matching.

gremlin> sail:sparql(‘SELECT ?x ?y WHERE { ?x foaf:knows ?y }’)

==>{y=v[http://ex.com#2], x=v[http://ex.com#1]}

==>{y=v[http://ex.com#4], x=v[http://ex.com#1]}

• Gremlin is useful for knowledge-based

reasoning

using

path

expressions.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Reasoning as Defining New Types of Adjacency

• Graph-based reasoning is the process

of making explicit what is implicit in

lop

co-developer

the graph.

created

created

• A

reasoner

takes

a

graph

G

marko

co-developer

and

a

collection

of

graph-patterns

peter

created

(i.e. transformation/rewrite rules) and

knows

knows

creates a new graph G (usually, G ⊂

G ).

G has new relationships/edges

josh

vadas

and thus, new definitions of vertex

adjacency.

created

• Example: The co-developers of person

ripple

A are those people who have created

the same software as person A and who

are themselves, not person A (as person

For these “co-developer” examples, we will use

A has created the same software as him

vertex 1 (marko) as the source of the reasoning

or herself).

process.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - The Co-Developers of Marko A. Rodriguez in SPARQL

name = "lop"

SELECT ?x WHERE {

lang = "java"

marko created ?y .

?y

3

name = "marko"

?z created ?y .

age = 29

created

created

?z != marko .

1

marko

?z

created

6

?z name ?x

name = "peter"

}

knows

knows

age = 35

?x

?z

name = "josh"

4

This query would return: josh and

age = 32

?x

2

peter.

created

5

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - The Co-Developers of Marko A. Rodriguez in Gremlin

co-developer

lop

co-developer

created

created

marko

co-developer

peter

created

knows

knows

josh

vadas

created

ripple

gremin> ./@name

==>marko

gremlin> ./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name

==>josh

==>peter

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - The Co-Developers of Marko A. Rodriguez in Gremlin

./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]/@name

1. .: Get the current object(s) (i.e. vertex 1 — denoting Marko).

2. outE[@label=‘created’]: Get the outgoing edges of the Marko vertex, where their

labels equal ‘created’.

3. inV: Get the incoming (i.e. head) vertices of those ‘created’ edges.

4. inE[@label=‘created’]: Get the incoming edges of those vertices, where their

labels equal ‘created’.

5. outV[g:except($ )]: Get the outgoing (i.e. tail) vertices of those ‘created’ edges,

where those vertices are not the Marko vertex.

6. @name: get the name of those non-Marko vertices.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Defining Co-Developers in Gremlin

path co-developer

./outE[@label=‘created’]/inV/inE[@label=‘created’]/outV[g:except($_)]

end

Once defined, you can use it like any other path segment.

gremlin> ./co-developer

==>v[4]

==>v[6]

gremlin> ./co-developer/@name

==>josh

==>peter

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Defining Co-Developers in Java

public class CoDeveloperPath implements Path {

public List invoke(Object root) {

if(root instanceof Vertex) {

List<Vertex> projects = new ArrayList<Vertex>();

for(Edge edge : ((Vertex)root).getOutEdges()) {

if(edge.getLabel().equals("created")) {

projects.add(edge.getInVertex());

}

}

List<Vertex> coDevelopers = new ArrayList<Vertex>();

for(Vertex project : projects) {

for(Edge edge : project.getInEdges()) {

if(edge.getLabel().equals("created") && edge.getOutVertex() != root) {

coDevelopers.add(edge.getOutVertex());

}

}

}

return coDevelopers;

} else {

return null;

}

}

}

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Outline

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

• Conclusions

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Gremlin Type System

object

element

graph

number

string

boolean

map

list

vertex

edge

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Predefined Paths and Properties

vertex 1 out edges

vertex 3 in edges

edge 9 label

edge 9 out vertex

edge 9 in vertex

edge 9 id

1

9

created

3

8

11

knows

created

4

vertex 4 id

vertex 4 properties

name = "josh"

age = 32

object

property

description

example

graph

V

the vertex iterator of the graph

$g/V

graph

E

the edge iterator of the graph

$g/E

vertex/edge

@id

the identifier of the element

$v/@id

vertex

outE

the outgoing edges of the vertex

$v/outE

vertex

inE

the incoming edges of the vertex

$v/inE

vertex

bothE

both in and out edges of the vertex

$v/bothE

edge

outV

the outgoing tail vertex of the edge

$e/outV

edge

inV

the incoming head vertex of the edge

$e/outV

edge

bothV

both in and out vertices of the edge

$e/bothV

edge

@label

the label of the edge

$e/@label

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Predefined Functions

g:assign()

g:remove-idx() g:list()

g:sort()

g:print()

g:assign()

g:load()

g:dedup()

g:map()

g:time()

g:unassign()

g:save()

g:union()

g:keys()

g:p()

g:id()

g:clear()

g:intersect()

g:values()

g:to-json()

g:key()

g:close()

g:difference() g:rand-nat()

g:from-json()

g:add-v()

g:keys()

g:retain()

g:rand-real()

...

g:add-e()

g:values()

g:except()

g:prob()

..

g:remove-ve()

g:map()

g:remove()

g:cont()

.

g:idx-all()

g:get()

g:get()

g:halt()

g:add-idx()

g:op-value()

g:op-value()

g:type()

There are over 70 predefined functions. See the following for a description of each.

http://wiki.github.com/tinkerpop/gremlin/core-function-library

http://wiki.github.com/tinkerpop/gremlin/gremlin-function-library

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Working With Non-Graph Types

gremlin> 1.2 + 6

==>7.2

gremlin> ‘this is a string’

==>this is a string

gremlin> true() or false()

==>true

gremlin> g:map(‘marko’,‘lanl’,‘peter’,‘neotech’,‘josh’,‘rpi’)

==>marko=lanl

==>peter=neotech

==>josh=rpi

gremlin> g:list(‘graphs’,‘hockey’,‘motorcylces’,6)

==>graphs

==>hockey

==>motorcylces

==>6.0

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Working With Non-Graph Types

gremlin> $m := g:map(‘hobbies’,g:list(‘hockey’,‘graphs’),

‘location’, g:map(‘state’,‘new mexico’, ‘city’, ‘santa fe’,

‘zipcode’, 87501), ‘age’, 30)

==>location={zipcode=87501.0, state=new mexico, city=santa fe}

==>age=30.0

==>hobbies=[hockey, graphs]

gremlin> $m/@age

==>30.0

gremlin> $m/@hobbies[2]

==>graphs

gremlin> $m/@location/@city

==>santa fe

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Variables

• Variables in Gremlin are prefixed with a $ character.

• There are a collection of reserved variables that all begin with $ .

$ is the root list of objects.

$ last is the last result evaluated by the evaluator.

$ g is the “working graph” to reduce typing with graph functions.

gremlin> $x := 1

==>1.0

gremlin> $y := 2

==>2.0

gremlin> $x + $y

==>3.0

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Language Statements

Variable Assignment

Repeat

gremlin> $i := 0

gremlin> $i := 1 + 5

==>0.0

==>6.0

gremlin> repeat 10

gremlin> $i

$i := $i + 1

==>6.0

end

==>10.0

If/Else

While

gremlin> if true()

gremlin> $i := ‘g’

$i := 1

==>g

else

gremlin> while not(matches($i, ‘ggg’))

$i := 2

$i := concat($i,‘g’)

end

end

==>1.0

==>ggg

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Language Statements

Foreach

Path

gremlin> $i := 0

gremlin> path friend_name

==>0.0

./outE[@label=‘knows’]/inV/@name

gremlin> foreach $j in 1 | 2 | 3

end

$i := $i + $j

gremlin> gremlin> ./friend_name

end

==>vadas

==>6.0

==>josh

Function

gremlin> func ex:hello($name)

concat(‘hello ’, $name)

end

gremlin> ex:hello(‘pavel’)

==>hello pavel

You can define functions and paths in native Gremlin (as demonstrated above) or in Java.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - XPath Filters

• Use [ ] filters to filter objects in a path expression (i.e. “such that” or

“where”)

• The evaluated result of [ ] must be a number or boolean.

If its a number, it is treated as the position within an array (i.e. list).

If it is boolean, it is treated as whether to include or exclude the

object from the next path in the sequence.

gremlin> ./outE[@label=‘knows’]

==>e[7][1-knows->2]

==>e[8][1-knows->4]

gremlin> ./outE[@label=‘knows’ and @weight>0.5]/inV[@age<21 or @name=‘josh’][true()][1]

==>v[4]

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Outline

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

• Conclusion

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset

2,500 concerts

35,000 songs played

600 songs

30 years

11 members

1 band

... the Grateful Dead.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset

• vertices denote songs and artists

type: “song” or “artist”

name: name of song or artist.

performances: number of times song was

played in concert.

song type: whether the song was a “cover”

or “original”.

• edges

denote

followed by,

sung by,

written by

weight:

number of times a song was

followed by another song over all concerts

played.

Rodriguez, M.A., Gintautas, V., Pepe, A., “A Grateful Dead Analysis: The Relationship Between Concert and Listening

Behavior,” First Monday, 14(1), University of Illinois at Chicago Library, http://arxiv.org/abs/0807.2466, January 2009.

NOTE: A portion of the raw dataset courtesy of Mark Leone http://www.cs.cmu.edu/ mleone/gdead/setlists.html

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset

Stanley Theater

type="artist"

type="artist"

name="Hunter"

Pittsburgh, PA (11/30/79)

name="Garcia"

type="song"

name="Scarlet.."

2nd Set

7

5

written_by

1

sung_by

-------------------

weight=239

Scarlet Begonias

followed_by

type="song"

Fire on the Mountain

name="Fire on.."

sung_by

sung_by

Passenger

written_by

2

Terrapin Station

type="artist"

weight=1

name="Lesh"

...

type="song"

followed_by

name="Pass.."

6

..

written_by

3

sung_by

.

followed_by

type="song"

name="Terrap.."

weight=2

4

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset – Load Data/Basic Stats

gremlin> g:load(‘data/graph-example-2.xml’)

==>true

gremlin> count($_g/V)

==>809.0

gremlin> count($_g/E)

==>8049.0

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset – Out-Degree of Each Vertex

gremlin> g:sort($degrees, ‘value’, true())

==>PLAYING IN THE BAND=96.0

==>SUGAR MAGNOLIA=92.0

==>PROMISED LAND=89.0

==>GOOD LOVING=87.0

==>NOT FADE AWAY=86.0

==>I KNOW YOU RIDER=85.0

==>CASSIDY=83.0

==>DEAL=82.0

==>JACK STRAW=81.0

==>ONE MORE SATURDAY NIGHT=81.0

==>EL PASO=80.0

==>MEXICALI BLUES=79.0

...

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset – Inspecting Single Vertex

gremlin> $v := g:key(‘name’,‘CHINA DOLL’)[1]

==>v[129]

gremlin> g:map($v)

==>name=CHINA DOLL

==>song_type=original

==>performances=114

==>type=song

gremlin> $v/outE[@label=‘sung_by’]/inV/@name

==>Garcia

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - A Grateful Dead Dataset – Inspecting Single Vertex

gremlin> $v/outE[@label=‘followed_by’]/inV/@name

==>BIG RIVER

==>THROWING STONES

==>SAMSON AND DELILAH

==>TRUCKING

==>CASEY JONES

==>HIGH TIME

...

gremlin> $v/outE[@label=‘followed_by’]/@weight

==>2

==>8

==>1

==>2

==>1

==>1

...

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Introduction to PageRank

• The remainder of this section will discuss the PageRank algorithm and

its application to multi-relational graphs.

• The arguments made and the examples presented generalizes to all other

single-relational graph algorithms. However, for the sake of brevity and

consistency, only PageRank will be discussed.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Introduction to Matrix-Based PageRank

• PageRank is a centrality measure based on the primary eigenvector

of a modified version of a graph. Let A ∈

+|V |×|V |

R

denote the

adjacency matrix representing the graph.

• In order to ensure a positive real values in the eigenvector, the graph

must be strongly connected.

PageRank induces strong connectivity

by overlaying a low probability (defined by α ∈ [0, 1] – usually 0.15)

|V |×|V |

“teleportation” graph over the original graph. Let B ∈ 1

denote

|V |

a teleportation adjacency matrix where ever vertex is connected to vertex

with equal probability.

C = (1 − α)A + αB, where C ∈ +|V |×|V |

R

λ = λC, where λ ∈ +|V |

R

is the PageRank vector over V .

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Introduction to Random Walk-Based PageRank

• PageRank can be implemented by a random walk.

• Create a vertex counter map, m : V → +

N .

• Place a walker on a random vertex in V . Denote the walker’s current

vertex i ∈ V .

1. increment the vertex counter by 1 (i.e. m(i) ← m(i) + 1).

2. the walker chooses a random adjacent vertex with probability α.

3. the walker chooses a random vertex in V with probability 1 − α.

4. rinse and repeat until m reaches a stationary probability distribution

(continually normalize m if you want a probability distribution).

• We will use this random walk model in the Gremlin examples to follow.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over Multi-Relational Graphs

• PageRank was designed for single-relational graphs (i.e. where all edges

have the same meaning).

• In a multi-relational graph, what does it mean to find the centrality

of a vertex when vertices can be related by various types of edges?

For example, if there exists “socializes with” and “met once”, then the

person who “met once” many people could be the most centrally located

in the graph. Also, what if you graph has more than just “person”-type

vertices (e.g. cars, pets, buildings, articles, etc.)

and “person”-type

edges (e.g. owns, walks, livesAt, cites, etc.).

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over Multi-Relational Graphs

• Calculating single-relational PageRank

would yield

...

Person as the most central

Person

type

vertex.

type

type

• You can boolean filter certain edge labels

type

type

(e.g. ignore type edges — in such cases,

type

you would have the centrality scores over

type

type

type

type

type type

the knows social graph).

• However, what if you only wanted to

traverse knows edges if and only if the

Herbert

Johan

Marko

Josh

Jen

...

adjacent vertex knows more than 10

other people?

knows

knows

knows

knows

• In the end,

you want complete

knows

knows

control

(universal

computability)

over

the

paths

that

the

traverser/walker can take through

a graph.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over Multi-Relational Graphs

• In multi-relational graphs, the meaning of your graph algorithm’s results are

defined by your definition of adjacency.

• With respect to random walk-based PageRank, define the path that the walker

should take. That path is the definition of adjacency.

• The stationary probability distribution created from this walk yields a path-dependent

centrality.

• Thus, in a multi-relational graph, there are many types of PageRanks that can

be calculated — one for each type of path defined for a walker.

Rodriguez, M.A., “Grammar-Based Random Walkers in Semantic Networks”, Knowledge-Based Systems,

21(7), 727–739, http://arxiv.org/abs/0803.4355, October 2008.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over “Garcia Followed By” SubGraph

• Define a path that will go from song-to-song by “followed by” edges and

only traverse songs that are “sung by” Jerry Garcia.

(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]

/inV[name=‘Garcia’]/../..)[g:rand-nat()]

A

B

C

D

/../..

followed_by

sung_by

name="Garcia"

g:rand-nat()

.

followed_by

sung_by

name="Garcia"

followed_by

sung_by

name="Weir"

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over “Garcia Followed By” SubGraph

path garcia-followed_by

(./outE[@label=‘followed_by’]/inV/outE[@label=‘sung_by’]

/inV[name=‘Garcia’]/../..)[g:rand-nat()]

end

$m := g:map()

$alpha := 0.15

$_ := g:key(‘type’, ‘song’)[g:rand-nat()]

repeat 2500

$_ := ./garcia-followed_by

if count($_) > 0

g:op-value(‘+’,$m,$_[1]/@name, 1.0)

end

if g:rand-real() < $alpha or count($_) = 0

$_ := g:key(‘type’, ’song’)[g:rand-nat()]

end

end

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - PageRank over “Garcia Followed By” SubGraph

gremlin> g:sort($m,‘value’,true())

==>CRAZY FINGERS=98.0

==>HES GONE=85.0

==>CHINA CAT SUNFLOWER=79.0

==>BERTHA=76.0

==>UNCLE JOHNS BAND=74.0

==>TERRAPIN STATION=72.0

==>GOING DOWN THE ROAD FEELING BAD=71.0

==>WHARF RAT=71.0

==>EYES OF THE WORLD=65.0

==>COLD RAIN AND SNOW=62.0

==>SHIP OF FOOLS=58.0

==>RAMBLE ON ROSE=53.0

==>CASEY JONES=51.0

==>DARK STAR=47.0

==>DEAL=46.0

...

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Universal Computation in Paths

path path-name

# any arbitrary computation can occur here

end

• A path definition can be used to define adjacencies.

adjacency can be expressed as anything that can be computed by a Turing machine.

path definitions are used to create “semantically meaningful” results from single-

relational graph algorithms applied to multi-relational graphs.

path definitions make explicit what is implicit in the structure of the graph. This

has applications to knowledge-based reasoning.

• A path definition can perform any arbitrary computation.

path definitions can check/set vertex/edge properties.

path definitions can create new vertices and edges.

path definitions can call/define functions.

This allows fine grained control over how your traverser/walker moves through a graph.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory –

• Introduction to Graphs and Graph Software

• Basic Gremlin Concepts

• Gremlin Language Description

• Advanced Gremlin Concepts

• Conclusions

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory –- The Current Gremlin EcoSystems

• Webling: Web console for Gremlin

(developed by Pavel Yaskevich w/ funding from Neo Technology)

Webling

• Project Gargamel: Distributed Graph Computing

(uses Linked Process and Gremlin)

• ReXster: A Graph-Based Recommender Engine

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory – - Thank You

Please enjoy Gremlin at http://gremlin.tinkerpop.com ...

My homepage is http://markorodriguez.com.

Please feel to contact me with any questions or comments.

Center for Nonlinear Studies PostDoc Seminar – Los Alamos National Laboratory –