Neo4j Graph Data Science 1.5: Exploring the HITS Algorithm
The Neo4j Graph Data Science Library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j, exposed as Cypher procedures. It recently published version 1.5, which has lots of goodies to play with.
In this blog post, we’re going to explore the newly added HITS algorithm with the help of a citations dataset.
Launching Neo4j
We’re going to run Neo4j with the Graph Data Science Library using the following Docker Compose configuration:
version: '3.7'
services:
neo4j:
image: neo4j:4.2.3-enterprise
container_name: "neo4j4.2-gds1.5-exploration"
volumes:
- ./plugins-4.2:/plugins
- ./data-4.2:/data
ports:
- "7474:7474"
- "7687:7687"
environment:
- NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
- NEO4J_AUTH=neo4j/neo
- NEO4JLABS_PLUGINS=["apoc", "graph-data-science"]
If you want to follow along with the examples used in the blog post, you can copy the configuration above to a file titled docker-compose.yml
We can now launch the Neo4j server by running the following command:
docker-compose up
After we’ve launched that command, we need to wait until we see the following output:
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:15.346+0000 INFO Bolt enabled on 0.0.0.0:7687.
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:16.053+0000 INFO Remote interface available at http://localhost:7474/
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:16.053+0000 INFO Started.
Importing the Citation Graph
We’re going to import the Citations Graph from the Using a Machine Learning Workflow for Link Prediction online training.
Let’s first connect to Neo4j using the Cypher Shell:
docker exec -it neo4j4.2-gds1.5-exploration bin/cypher-shell -u neo4j -p neo
Connected to Neo4j using Bolt protocol version 4.2 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@neo4j>
The data for the citation graph is available as a set of JSON lines files, which we can find at https://github.com/mneedham/link-prediction/tree/master/data. Below is an example of one line of one of the files:
{"authors": ["Tegegne Marew", "Doo-Hwan Bae"], "n_citation": 1, "references": ["2134bf3b-fd89-4724-90ce-5993b4fa3218", "906c17e0-db09-407b-b760-41df5a3f0293", "94f4382e-cfa6-4aec-92b8-3711fc55da54", "9f172585-8d42-4fce-b6ae-aede321f3fd4", "a3aee287-efd0-4b9d-9cda-d47dd192c9f4", "a9a7fd07-ef71-4b3c-8fcf-d7fe114d2148", "d63dd4ae-4b30-484b-8ffc-88d21839ddad"], "title": "Using Classpects for Integrating Non-Functional and Functional Requirements.", "venue": "international conference on software engineering", "year": 2006, "id": "01f1d231-80ae-4cce-b56c-9d821e0924d0"}
We’re going to use some APOC procedures to convert these JSON files into the following graph structure:
So we’ll have the following node labels:
-
Article
- a paper on a topic -
Venue
- where the paper was presented -
Author
- the person/people that worked on theArticle
And these relationship types:
-
CITED
- which articles were cited by an article -
AUTHOR
- which articles were written by a person/people -
VENUE
- where an article was presented
Let’s first create a new database and setup some constraints to make sure we don’t end up with duplicates:
CREATE OR REPLACE DATABASE citationsblogpost;
:use citationsblogpost;
CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE;
CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE;
CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;
Now we’ll import the data:
CALL apoc.periodic.iterate(
'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file)
YIELD value WITH value
return value',
'MERGE (a:Article {index:value.id})
SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
MERGE (v:Venue {name: venue})
MERGE (a)-[:VENUE]->(v)
FOREACH(author in authors |
MERGE (b:Author{name:author})
MERGE (a)-[:AUTHOR]->(b))
FOREACH(citation in citations |
MERGE (cited:Article {index:citation})
MERGE (a)-[:CITED]->(cited))',
{batchSize: 1000, iterateList: true})
YIELD batches, total, timeTaken, committedOperations;
batches | total | timeTaken | committedOperations |
---|---|---|---|
52 |
51956 |
21 |
51956 |
And finally, a bit of cleanup to remove articles that don’t have a title:
MATCH (a:Article)
WHERE not(exists(a.title))
DETACH DELETE a;
HITS Algorithm
The HITs algorithm, like many other graph algorithms, was invented to do link analysis on web pages. It is a centrality algorithm, which means that it indicates node importance based on some metric. We can learn more about it from the HITS Wikipedia page:
The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages.
The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
So a page with a high authority score has high value content, whereas a page with a high hub score links out to important pages.
We’re going to use this algorithm to analyse the citations between articles in our graph, so what does those different scores mean for us?
-
An article with a high authority score will likely have a lot of citations, perhaps some of those by other important articles
-
An article with a high hub score can help direct us (via its citations) to the important articles. It’s not clear to me that the hub score makes so much sense in this graph because there aren’t really articles written with the intention of pointing people towards a bunch of other articles!
Let’s give the algorithm a try and see what we find. We can return a list of the available procedures by running the following query:
CALL gds.list("hits")
YIELD name, description
RETURN name, description;
name | description |
---|---|
"gds.alpha.hits.mutate" |
"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes" |
"gds.alpha.hits.mutate.estimate" |
"Returns an estimation of the memory consumption for that procedure." |
"gds.alpha.hits.stats" |
"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes" |
"gds.alpha.hits.stats.estimate" |
"Returns an estimation of the memory consumption for that procedure." |
"gds.alpha.hits.stream" |
"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes" |
"gds.alpha.hits.stream.estimate" |
"Returns an estimation of the memory consumption for that procedure." |
"gds.alpha.hits.write" |
"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes" |
"gds.alpha.hits.write.estimate" |
"Returns an estimation of the memory consumption for that procedure." |
Before we run the algorithm, we’ll create a projected graph called citation_graph
, by running the following:
CALL gds.graph.create("citation_graph", "Article", "CITED");
nodeProjection | relationshipProjection | graphName | nodeCount | relationshipCount | createMillis |
---|---|---|---|---|---|
{Article: {properties: {}, label: "Article"}} |
{CITED: {orientation: "NATURAL", aggregation: "DEFAULT", type: "CITED", properties: {}}} |
"citation_graph" |
51956 |
28706 |
149 |
And now we’ll run the write version of the algorithm against the projected graph:
CALL gds.alpha.hits.write("citation_graph", {
hitsIterations: 20
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
writeMillis | nodePropertiesWritten | ranIterations | postProcessingMillis | createMillis | computeMillis |
---|---|---|---|---|---|
174 |
103912 |
81 |
0 |
3 |
390 |
By default, this procedure will create pregel_auth
and pregel_hub
properties on each of the Article
nodes storing the computed scores.
Analysing authority scores
Let’s see which articles rank highest, starting with authority:
MATCH (a:Article)
RETURN a.title, a.year, substring(a.abstract, 0, 300) AS abstract,
[(a)-[:AUTHOR]->(auth) | auth.name] AS authors,
round(a.pregel_auth, 3) AS auth
ORDER BY auth DESC
LIMIT 10;
a.title | a.year | abstract | authors | auth |
---|---|---|---|---|
"Rough sets" |
1995 |
"Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s [11, 12], is a new mathematical tool to deal with vagueness and uncertainty. This approach seems to be of fundamental importance to artificial intelligence (AI) and cognitive sciences, especially in the areas of machine learning, kno" |
["Jerzy W. Grzymala-Busse", "Wojciech Ziarko", "Zdzisław Pawlak", "Roman Słowiński"] |
0.99 |
"Fuzzy Similarity Relation as a Basis for Rough Approximations" |
1998 |
"The rough sets theory proposed by Pawlak was originally founded on the idea of approximating a given set by means of indiscernibility binary relation, which was assumed to be an equivalence relation (reflexive, symmetric and transitive). With respect to this basic idea, two main theoretical developm" |
["Roman Słowiński", "Salvatore Greco", "Benedetto Matarazzo"] |
0.042 |
"Toward Intelligent Systems: Calculi of Information Granules" |
2001 |
"We present an approach based on calculi of information granules as a basis for approximate reasoning in intelligent systems. Approximate reasoning schemes are defined by means of information granule construction schemes satisfying some robustness constraints. In distributed environments such schemes" |
["Andrzej Skowron"] |
0.042 |
"Approximation spaces and information granulation" |
2005 |
"In this paper, we discuss approximation spaces in a granular computing framework. Such approximation spaces generalise the approaches to concept approximation existing in rough set theory. Approximation spaces are constructed as higher level information granules and are obtained as the result of com" |
["Andrzej Skowron", "Piotr Synak", "Roman Świniarski"] |
0.038 |
"Layered learning for concept synthesis" |
2004 |
"We present a hierarchical scheme for synthesis of concept approximations based on given data and domain knowledge. We also propose a solution, founded on rough set theory, to the problem of con- structing the approximation of higher level concepts by composing the approximation of lower level concep" |
["Andrzej Skowron", "Jan G. Bazan", "Hung Son Nguyen", "Sinh Hoa Nguyen"] |
0.037 |
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" |
2000 |
"In the paper nine different approaches to missing attribute values are presented and compared. Ten input data files were used to investigate the performance of the nine methods to deal with missing attribute values. For testing both naive classification and new classification techniques of LERS (Lea" |
["Jerzy W. Grzymala-Busse", "Ming Hu"] |
0.036 |
"Variable Consistency Model of Dominance-Based Rough Sets Approach" |
2000 |
"Consideration of preference-orders requires the use of an extended rough set model called Dominance-based Rough Set Approach (DRSA). The rough approximations defined within DRSA are based on consistency in the sense of dominance principle. It requires that objects having not-worse evaluation with re" |
["Benedetto Matarazzo", "Salvatore Greco", "Roman Słowiński", "Jerzy Stefanowski"] |
0.029 |
"RSES and RSESlib - A Collection of Tools for Rough Set Computations" |
2000 |
"Rough Set Exploration System - a set of software tools featuring a library of methods and a graphical user interface is presented. Methods, features and abilities of the implemented software are discussed and illustrated with a case study in data analysis." |
["Marcin S. Szczuka", "Jan G. Bazan"] |
0.026 |
"A New Version of Rough Set Exploration System" |
2002 |
"We introduce a new version of the Rough Set Exploration System - a software tool featuring a library of methods and a graphical user interface supporting variety of rough-set-based computations. Methods, features and abilities of the implemented software are discussed and illustrated with a case stu" |
["Marcin S. Szczuka", "Jakub Wróblewski", "Jan G. Bazan"] |
0.026 |
"Rough sets and information granulation" |
2003 |
"In this paper, the study of the evolution of approximation space theory and its applications is considered in the context of rough sets introduced by Zdzislaw Pawlak and information granulation as well as computing with words formulated by Lotfi Zadeh. Central to this evolution is the rough-mereolog" |
["Piotr Synak", "James F. Peters", "Andrzej Skowron", "Sheela Ramanna"] |
0.026 |
The top article by some distance on this metric is "Rough sets", which was written more than 25 years ago. I found it interesting that the abstract talks about it being an approach that is fundamental to AI and machine learning, which are important fields in 2021.
We can have a look at the hub nodes that point to these articles by running the following query:
MATCH (a:Article)
WITH a, [(a)<-[:CITED]-(other) | other] AS citations
WITH a, apoc.coll.sortNodes(citations, "pregel_hub")[..5] AS topHubs
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
[c in topHubs | {article: c.title, score: round(c.pregel_hub, 3)}] AS topHubs
ORDER BY auth DESC
LIMIT 10;
a.title | a.year | auth | topHubs |
---|---|---|---|
"Rough sets" |
1995 |
0.99 |
[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}] |
"Fuzzy Similarity Relation as a Basis for Rough Approximations" |
1998 |
0.042 |
[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Lattices with Interior and Closure Operators and Abstract Approximation Spaces"}] |
"Toward Intelligent Systems: Calculi of Information Granules" |
2001 |
0.042 |
[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.072, article: "Rough sets and information granulation"}, {score: 0.071, article: "A Note on Ziarko’s Variable Precision Rough Set Model and Nonmonotonic Reasoning"}, {score: 0.071, article: "A Partition Model of Granular Computing"}] |
"Approximation spaces and information granulation" |
2005 |
0.038 |
[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}] |
"Layered learning for concept synthesis" |
2004 |
0.037 |
[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "P300 wave detection based on rough sets"}] |
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" |
2000 |
0.036 |
[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.075, article: "The rough set exploration system"}, {score: 0.071, article: "Missing template decomposition method and its implementation in rough set exploration system"}, {score: 0.07, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.07, article: "Characteristic relations for incomplete data: a generalization of the indiscernibility relation"}] |
"Variable Consistency Model of Dominance-Based Rough Sets Approach" |
2000 |
0.029 |
[{score: 0.072, article: "Rough Set Analysis of Preference-Ordered Data"}, {score: 0.072, article: "Variable-precision dominance-based rough set approach"}, {score: 0.071, article: "On variable consistency dominance-based rough set approaches"}, {score: 0.071, article: "Multicriteria choice and ranking using decision rules induced from rough approximation of graded preference relations"}, {score: 0.07, article: "Rough set approach to customer satisfaction analysis"}] |
"RSES and RSESlib - A Collection of Tools for Rough Set Computations" |
2000 |
0.026 |
[{score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.073, article: "Hybridization of rough sets and statistical learning theory"}, {score: 0.072, article: "Ontology driven concept approximation"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "A statistical method for determining importance of variables in an information system"}] |
"A New Version of Rough Set Exploration System" |
2002 |
0.026 |
[{score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "Introducing a rule importance measure"}, {score: 0.069, article: "NetTRS induction and postprocessing of decision rules"}, {score: 0.069, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}] |
"Rough sets and information granulation" |
2003 |
0.026 |
[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}, {score: 0.071, article: "Time complexity of decision trees"}] |
Based on the top hubs, it’s not really obvious why the authority score for "Rough sets" is so much higher than the other articles. Perhaps if we return the max, min, and average hub scores we’ll be able to figure it out?
MATCH (a:Article)
WITH a, [(a)<-[:CITED]-(other) | other] AS citations
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
round(apoc.coll.max([c in citations | c.pregel_hub]), 3) AS maxHub,
round(apoc.coll.min([c in citations | c.pregel_hub]), 3) AS minHub,
round(apoc.coll.avg([c in citations | c.pregel_hub]), 3) AS averageHub,
size(citations) AS citations
ORDER BY auth DESC
LIMIT 10;
a.title | a.year | auth | maxHub | minHub | averageHub | citations |
---|---|---|---|---|---|---|
"Rough sets" |
1995 |
0.99 |
0.083 |
0.068 |
0.069 |
211 |
"Toward Intelligent Systems: Calculi of Information Granules" |
2001 |
0.042 |
0.083 |
0.003 |
0.036 |
17 |
"Fuzzy Similarity Relation as a Basis for Rough Approximations" |
1998 |
0.042 |
0.082 |
0.003 |
0.061 |
10 |
"Approximation spaces and information granulation" |
2005 |
0.038 |
0.083 |
0.005 |
0.055 |
10 |
"Layered learning for concept synthesis" |
2004 |
0.037 |
0.083 |
0.003 |
0.05 |
11 |
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" |
2000 |
0.036 |
0.082 |
0.002 |
0.048 |
11 |
"Variable Consistency Model of Dominance-Based Rough Sets Approach" |
2000 |
0.029 |
0.072 |
0.004 |
0.061 |
7 |
"RSES and RSESlib - A Collection of Tools for Rough Set Computations" |
2000 |
0.026 |
0.079 |
0.002 |
0.032 |
12 |
"Rough sets and information granulation" |
2003 |
0.026 |
0.083 |
0.005 |
0.065 |
6 |
"A New Version of Rough Set Exploration System" |
2002 |
0.026 |
0.075 |
0.002 |
0.038 |
10 |
From this output we learn that "Rough sets" is being cited by a lot of articles with a good hub score.
The other articles have a similar maxHub
score and some even have a similar averageHub
, but their minHub
is significantly less.
It also has 10x as many citations as any of the other articles in the top 10, so that would contribute to the higher score as well.
HITS Authority vs PageRank
The HITS Authority score and the PageRank algorithm both compute scores that indicate the importance of a node in a graph, so I was curious whether there was any correlation between the scores. i.e. do the nodes with the highest HITS authority score also have a high PageRank score?
To recap, this is what PageRank measures:
The PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes.
We can compute the PageRank score for articles, by running the following query:
CALL gds.pageRank.write("citation_graph", {
maxIterations: 20,
writeProperty: "pagerank"
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
writeMillis | nodePropertiesWritten | ranIterations | postProcessingMillis | createMillis | computeMillis |
---|---|---|---|---|---|
29 |
51956 |
20 |
0 |
0 |
112 |
And now let’s put the PageRank scores alongside the HITS Authority scores:
MATCH (a:Article)
RETURN a.title, a.year,
round(a.pregel_auth, 3) AS auth,
round(a.pagerank, 3) AS pagerank,
size([(a)<-[:CITED]-(other) | other]) AS citations
ORDER BY auth DESC
LIMIT 10;
a.title | a.year | auth | pagerank | citations |
---|---|---|---|---|
"Rough sets" |
1995 |
0.99 |
25.609 |
211 |
"Fuzzy Similarity Relation as a Basis for Rough Approximations" |
1998 |
0.042 |
0.738 |
10 |
"Toward Intelligent Systems: Calculi of Information Granules" |
2001 |
0.042 |
1.862 |
17 |
"Approximation spaces and information granulation" |
2005 |
0.038 |
0.418 |
10 |
"Layered learning for concept synthesis" |
2004 |
0.037 |
0.505 |
11 |
"A Comparison of Several Approaches to Missing Attribute Values in Data Mining" |
2000 |
0.036 |
0.896 |
11 |
"Variable Consistency Model of Dominance-Based Rough Sets Approach" |
2000 |
0.029 |
0.471 |
7 |
"RSES and RSESlib - A Collection of Tools for Rough Set Computations" |
2000 |
0.026 |
1.296 |
12 |
"A New Version of Rough Set Exploration System" |
2002 |
0.026 |
0.682 |
10 |
"Rough sets and information granulation" |
2003 |
0.026 |
0.375 |
6 |
Rough Sets is the only one with a high PageRank score as well. In fact, its PageRank score is the 3rd highest in the graph, which we can see by running the following query:
MATCH (a:Article)
RETURN a.title, a.year,
round(a.pregel_auth, 5) AS auth,
round(a.pagerank, 5) AS pagerank,
size([(a)<-[:CITED]-(other) | other]) AS citations
ORDER BY pagerank DESC
LIMIT 10;
a.title | a.year | auth | pagerank | citations |
---|---|---|---|---|
"A method for obtaining digital signatures and public-key cryptosystems" |
1978 |
5.0E-5 |
93.94313 |
125 |
"Secure communications over insecure channels" |
1978 |
0.0 |
79.86924 |
7 |
"Rough sets" |
1995 |
0.9902 |
25.60911 |
211 |
"An axiomatic basis for computer programming" |
1969 |
4.4E-4 |
23.02937 |
93 |
"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems" |
2001 |
0.0 |
21.46956 |
108 |
"SCRIBE: The Design of a Large-Scale Event Notification Infrastructure" |
2001 |
0.0 |
19.4863 |
14 |
"A field study of the software design process for large systems" |
1988 |
0.0 |
19.02815 |
53 |
"Productivity factors and programming environments" |
1984 |
0.0 |
18.49935 |
5 |
"Analyzing medium-scale software development" |
1978 |
0.0 |
16.45275 |
5 |
"A Calculus of Communicating Systems" |
1982 |
0.0 |
15.43059 |
55 |
I find it kinda interesting that while these articles have very high transitive importance, their HITS Authority score is very low. Many of them have a lot of citations as well, but presumably most of those citations aren’t from hub nodes.
Analysing hub scores
Speaking of hubs, let’s explore those in a bit more detail. We can find the articles with the highest hub score, by running the following query:
MATCH (a:Article)
WITH a, [(a)-[:CITED]->(other) | other] AS cited
RETURN a.title, a.year,
round(a.pregel_hub, 3) AS hub,
round(apoc.coll.max([c in cited | c.pregel_auth]), 3) AS maxAuth,
round(apoc.coll.min([c in cited | c.pregel_auth]), 3) AS minAuth,
round(apoc.coll.avg([c in cited | c.pregel_auth]), 3) AS averageAuth,
size(cited) AS cited
ORDER BY a.pregel_hub DESC
LIMIT 10;
a.title | a.year | hub | maxAuth | minAuth | averageAuth | cited |
---|---|---|---|---|---|---|
"Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces" |
2005 |
0.083 |
0.99 |
0.006 |
0.102 |
12 |
"Some Issues on Rough Sets" |
2004 |
0.082 |
0.99 |
0.006 |
0.134 |
9 |
"A treatise on rough sets" |
2005 |
0.079 |
0.99 |
0.005 |
0.145 |
8 |
"Approximate boolean reasoning: foundations and applications in data mining" |
2006 |
0.079 |
0.99 |
0.005 |
0.115 |
10 |
"Multimodal classification: case studies" |
2006 |
0.075 |
0.99 |
0.005 |
0.122 |
9 |
"The rough set exploration system" |
2005 |
0.075 |
0.99 |
0.005 |
0.157 |
7 |
"On generalized rough fuzzy approximation operators" |
2006 |
0.075 |
0.99 |
0.026 |
0.274 |
4 |
"Lattices with Interior and Closure Operators and Abstract Approximation Spaces" |
2009 |
0.074 |
0.99 |
0.005 |
0.136 |
8 |
"Matching 2d image segments with genetic algorithms and approximation spaces" |
2006 |
0.074 |
0.99 |
0.005 |
0.154 |
7 |
"Hybridization of rough sets and statistical learning theory" |
2011 |
0.073 |
0.99 |
0.005 |
0.214 |
5 |
The maxAuth
scores tell us that all of these articles cite the "Rough sets" article that we came across in the previous section.
There aren’t really any other articles with a high authority score, so we can assume that nearly all of the hub score is coming from citing "Rough sets".
In any case, let’s have a look at the other authorities that these articles have cited:
MATCH (a:Article)
WITH a, [(a)-[:CITED]->(other) | other] AS cited
WITH a, apoc.coll.sortNodes(cited, "pregel_auth")[..5] AS topAuthorities
RETURN a.title, a.year,
round(a.pregel_hub, 3) AS hub,
[c in topAuthorities | {article: c.title, score: round(c.pregel_auth, 3)}] AS topAuthorities
ORDER BY hub DESC
LIMIT 10;
a.title | a.year | hub | topAuthorities |
---|---|---|---|
"Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces" |
2005 |
0.083 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}] |
"Some Issues on Rough Sets" |
2004 |
0.082 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}] |
"A treatise on rough sets" |
2005 |
0.079 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}] |
"Approximate boolean reasoning: foundations and applications in data mining" |
2006 |
0.079 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "RSES and RSESlib - A Collection of Tools for Rough Set Computations"}, {score: 0.021, article: "Some Issues on Rough Sets"}] |
"Multimodal classification: case studies" |
2006 |
0.075 |
[{score: 0.99, article: "Rough sets"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "A New Version of Rough Set Exploration System"}, {score: 0.015, article: "The rough set exploration system"}, {score: 0.01, article: "Rough Set Methods in Approximation of Hierarchical Concepts"}] |
"The rough set exploration system" |
2005 |
0.075 |
[{score: 0.99, article: "Rough sets"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}, {score: 0.021, article: "Rough Sets and Decision Algorithms"}, {score: 0.021, article: "In Pursuit of Patterns in Data Reasoning from Data The Rough Set Way"}, {score: 0.015, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}] |
"On generalized rough fuzzy approximation operators" |
2006 |
0.075 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}] |
"Lattices with Interior and Closure Operators and Abstract Approximation Spaces" |
2009 |
0.074 |
[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.024, article: "Approximation Operators in Qualitative Data Analysis"}, {score: 0.015, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.005, article: "Algebraic structures for rough sets"}] |
"Matching 2d image segments with genetic algorithms and approximation spaces" |
2006 |
0.074 |
[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}, {score: 0.01, article: "K-means Indiscernibility Relation over Pixels"}, {score: 0.006, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}] |
"Hybridization of rough sets and statistical learning theory" |
2011 |
0.073 |
[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "RSES and RSESlib - A Collection of Tools for Rough Set Computations"}, {score: 0.01, article: "Accuracy and Coverage in Rough Set Rule Induction"}, {score: 0.005, article: "Generalized indiscernibility relations: applications for missing values and analysis of structural objects"}] |
The top 2 articles both cited "Toward Intelligent Systems: Calculi of Information Granules", which gives them a marginally higher score than the other 8. But I don’t think these hub scores are telling us all that much about these articles.
In Summary
While I’m not sure that this is the greatest data set to show off this algorithm, I think the algorithm itself is an interesting addition to the library. I’m curious to see how well it would fare on a Twitter graph - perhaps the HITS Hub score would help to identify those accounts that primarily tweet out links to interesting content? I guess that exploration will have to wait for another post!
About the author
I'm currently working on short form content at ClickHouse. I publish short 5 minute videos showing how to solve data problems on YouTube @LearnDataWithMark. I previously worked on graph analytics at Neo4j, where I also co-authored the O'Reilly Graph Algorithms Book with Amy Hodler.