· neo4j graph-data-science graph-algorithms

Neo4j Graph Data Science 1.5: Exploring the HITS Algorithm

The Neo4j Graph Data Science Library provides efficiently implemented, parallel versions of common graph algorithms for Neo4j, exposed as Cypher procedures. It recently published version 1.5, which has lots of goodies to play with.

gds v1.5 hits

In this blog post, we’re going to explore the newly added HITS algorithm with the help of a citations dataset.

Launching Neo4j

We’re going to run Neo4j with the Graph Data Science Library using the following Docker Compose configuration:

docker-compose.yml
version: '3.7'
services:
  neo4j:
    image: neo4j:4.2.3-enterprise
    container_name: "neo4j4.2-gds1.5-exploration"
    volumes:
      - ./plugins-4.2:/plugins
      - ./data-4.2:/data
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_ACCEPT_LICENSE_AGREEMENT=yes
      - NEO4J_AUTH=neo4j/neo
      - NEO4JLABS_PLUGINS=["apoc", "graph-data-science"]

If you want to follow along with the examples used in the blog post, you can copy the configuration above to a file titled docker-compose.yml We can now launch the Neo4j server by running the following command:

docker-compose up

After we’ve launched that command, we need to wait until we see the following output:

Output
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:15.346+0000 INFO  Bolt enabled on 0.0.0.0:7687.
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:16.053+0000 INFO  Remote interface available at http://localhost:7474/
neo4j4.2-gds1.5-exploration | 2021-02-03 21:39:16.053+0000 INFO  Started.

Importing the Citation Graph

We’re going to import the Citations Graph from the Using a Machine Learning Workflow for Link Prediction online training.

Let’s first connect to Neo4j using the Cypher Shell:

docker exec -it neo4j4.2-gds1.5-exploration bin/cypher-shell -u neo4j -p neo
Output
Connected to Neo4j using Bolt protocol version 4.2 at neo4j://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@neo4j>

The data for the citation graph is available as a set of JSON lines files, which we can find at https://github.com/mneedham/link-prediction/tree/master/data. Below is an example of one line of one of the files:

{"authors": ["Tegegne Marew", "Doo-Hwan Bae"], "n_citation": 1, "references": ["2134bf3b-fd89-4724-90ce-5993b4fa3218", "906c17e0-db09-407b-b760-41df5a3f0293", "94f4382e-cfa6-4aec-92b8-3711fc55da54", "9f172585-8d42-4fce-b6ae-aede321f3fd4", "a3aee287-efd0-4b9d-9cda-d47dd192c9f4", "a9a7fd07-ef71-4b3c-8fcf-d7fe114d2148", "d63dd4ae-4b30-484b-8ffc-88d21839ddad"], "title": "Using Classpects for Integrating Non-Functional and Functional Requirements.", "venue": "international conference on software engineering", "year": 2006, "id": "01f1d231-80ae-4cce-b56c-9d821e0924d0"}

We’re going to use some APOC procedures to convert these JSON files into the following graph structure:

graph
Figure 1. Graph Model

So we’ll have the following node labels:

  • Article - a paper on a topic

  • Venue - where the paper was presented

  • Author - the person/people that worked on the Article

And these relationship types:

  • CITED - which articles were cited by an article

  • AUTHOR - which articles were written by a person/people

  • VENUE - where an article was presented

Let’s first create a new database and setup some constraints to make sure we don’t end up with duplicates:

Create new database and setup constraints
CREATE OR REPLACE DATABASE citationsblogpost;
:use citationsblogpost;

CREATE CONSTRAINT ON (a:Article) ASSERT a.index IS UNIQUE;
CREATE CONSTRAINT ON (a:Author) ASSERT a.name IS UNIQUE;
CREATE CONSTRAINT ON (v:Venue) ASSERT v.name IS UNIQUE;

Now we’ll import the data:

Create citations graph
CALL apoc.periodic.iterate(
  'UNWIND ["dblp-ref-0.json", "dblp-ref-1.json", "dblp-ref-2.json", "dblp-ref-3.json"] AS file
   CALL apoc.load.json("https://github.com/mneedham/link-prediction/raw/master/data/" + file)
   YIELD value WITH value
   return value',
  'MERGE (a:Article {index:value.id})
   SET a += apoc.map.clean(value,["id","authors","references", "venue"],[0])
   WITH a, value.authors as authors, value.references AS citations, value.venue AS venue
   MERGE (v:Venue {name: venue})
   MERGE (a)-[:VENUE]->(v)
   FOREACH(author in authors |
     MERGE (b:Author{name:author})
     MERGE (a)-[:AUTHOR]->(b))
   FOREACH(citation in citations |
     MERGE (cited:Article {index:citation})
     MERGE (a)-[:CITED]->(cited))',
   {batchSize: 1000, iterateList: true})
YIELD batches, total, timeTaken, committedOperations;
Table 1. Results
batches total timeTaken committedOperations

52

51956

21

51956

And finally, a bit of cleanup to remove articles that don’t have a title:

Remove articles that don’t have a title
MATCH (a:Article)
WHERE not(exists(a.title))
DETACH DELETE a;

HITS Algorithm

The HITs algorithm, like many other graph algorithms, was invented to do link analysis on web pages. It is a centrality algorithm, which means that it indicates node importance based on some metric. We can learn more about it from the HITS Wikipedia page:

The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages.

The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.

So a page with a high authority score has high value content, whereas a page with a high hub score links out to important pages.

We’re going to use this algorithm to analyse the citations between articles in our graph, so what does those different scores mean for us?

  • An article with a high authority score will likely have a lot of citations, perhaps some of those by other important articles

  • An article with a high hub score can help direct us (via its citations) to the important articles. It’s not clear to me that the hub score makes so much sense in this graph because there aren’t really articles written with the intention of pointing people towards a bunch of other articles!

Let’s give the algorithm a try and see what we find. We can return a list of the available procedures by running the following query:

List the HITS procedures
CALL gds.list("hits")
YIELD name, description
RETURN name, description;
Table 2. Results
name description

"gds.alpha.hits.mutate"

"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes"

"gds.alpha.hits.mutate.estimate"

"Returns an estimation of the memory consumption for that procedure."

"gds.alpha.hits.stats"

"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes"

"gds.alpha.hits.stats.estimate"

"Returns an estimation of the memory consumption for that procedure."

"gds.alpha.hits.stream"

"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes"

"gds.alpha.hits.stream.estimate"

"Returns an estimation of the memory consumption for that procedure."

"gds.alpha.hits.write"

"Hyperlink-Induced Topic Search (HITS) is a link analysis algorithm that rates nodes"

"gds.alpha.hits.write.estimate"

"Returns an estimation of the memory consumption for that procedure."

Before we run the algorithm, we’ll create a projected graph called citation_graph, by running the following:

Create projected graph
CALL gds.graph.create("citation_graph", "Article", "CITED");
Table 3. Results
nodeProjection relationshipProjection graphName nodeCount relationshipCount createMillis

{Article: {properties: {}, label: "Article"}}

{CITED: {orientation: "NATURAL", aggregation: "DEFAULT", type: "CITED", properties: {}}}

"citation_graph"

51956

28706

149

And now we’ll run the write version of the algorithm against the projected graph:

Run HITS algorithm
CALL gds.alpha.hits.write("citation_graph", {
  hitsIterations: 20
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
Table 4. Results
writeMillis nodePropertiesWritten ranIterations postProcessingMillis createMillis computeMillis

174

103912

81

0

3

390

By default, this procedure will create pregel_auth and pregel_hub properties on each of the Article nodes storing the computed scores.

Analysing authority scores

Let’s see which articles rank highest, starting with authority:

MATCH (a:Article)
RETURN a.title, a.year, substring(a.abstract, 0, 300) AS abstract,
       [(a)-[:AUTHOR]->(auth) | auth.name] AS authors,
       round(a.pregel_auth, 3) AS auth
ORDER BY auth DESC
LIMIT 10;
Table 5. Results
a.title a.year abstract authors auth

"Rough sets"

1995

"Rough set theory, introduced by Zdzislaw Pawlak in the early 1980s [11, 12], is a new mathematical tool to deal with vagueness and uncertainty. This approach seems to be of fundamental importance to artificial intelligence (AI) and cognitive sciences, especially in the areas of machine learning, kno"

["Jerzy W. Grzymala-Busse", "Wojciech Ziarko", "Zdzisław Pawlak", "Roman Słowiński"]

0.99

"Fuzzy Similarity Relation as a Basis for Rough Approximations"

1998

"The rough sets theory proposed by Pawlak was originally founded on the idea of approximating a given set by means of indiscernibility binary relation, which was assumed to be an equivalence relation (reflexive, symmetric and transitive). With respect to this basic idea, two main theoretical developm"

["Roman Słowiński", "Salvatore Greco", "Benedetto Matarazzo"]

0.042

"Toward Intelligent Systems: Calculi of Information Granules"

2001

"We present an approach based on calculi of information granules as a basis for approximate reasoning in intelligent systems. Approximate reasoning schemes are defined by means of information granule construction schemes satisfying some robustness constraints. In distributed environments such schemes"

["Andrzej Skowron"]

0.042

"Approximation spaces and information granulation"

2005

"In this paper, we discuss approximation spaces in a granular computing framework. Such approximation spaces generalise the approaches to concept approximation existing in rough set theory. Approximation spaces are constructed as higher level information granules and are obtained as the result of com"

["Andrzej Skowron", "Piotr Synak", "Roman Świniarski"]

0.038

"Layered learning for concept synthesis"

2004

"We present a hierarchical scheme for synthesis of concept approximations based on given data and domain knowledge. We also propose a solution, founded on rough set theory, to the problem of con- structing the approximation of higher level concepts by composing the approximation of lower level concep"

["Andrzej Skowron", "Jan G. Bazan", "Hung Son Nguyen", "Sinh Hoa Nguyen"]

0.037

"A Comparison of Several Approaches to Missing Attribute Values in Data Mining"

2000

"In the paper nine different approaches to missing attribute values are presented and compared. Ten input data files were used to investigate the performance of the nine methods to deal with missing attribute values. For testing both naive classification and new classification techniques of LERS (Lea"

["Jerzy W. Grzymala-Busse", "Ming Hu"]

0.036

"Variable Consistency Model of Dominance-Based Rough Sets Approach"

2000

"Consideration of preference-orders requires the use of an extended rough set model called Dominance-based Rough Set Approach (DRSA). The rough approximations defined within DRSA are based on consistency in the sense of dominance principle. It requires that objects having not-worse evaluation with re"

["Benedetto Matarazzo", "Salvatore Greco", "Roman Słowiński", "Jerzy Stefanowski"]

0.029

"RSES and RSESlib - A Collection of Tools for Rough Set Computations"

2000

"Rough Set Exploration System - a set of software tools featuring a library of methods and a graphical user interface is presented. Methods, features and abilities of the implemented software are discussed and illustrated with a case study in data analysis."

["Marcin S. Szczuka", "Jan G. Bazan"]

0.026

"A New Version of Rough Set Exploration System"

2002

"We introduce a new version of the Rough Set Exploration System - a software tool featuring a library of methods and a graphical user interface supporting variety of rough-set-based computations. Methods, features and abilities of the implemented software are discussed and illustrated with a case stu"

["Marcin S. Szczuka", "Jakub Wróblewski", "Jan G. Bazan"]

0.026

"Rough sets and information granulation"

2003

"In this paper, the study of the evolution of approximation space theory and its applications is considered in the context of rough sets introduced by Zdzislaw Pawlak and information granulation as well as computing with words formulated by Lotfi Zadeh. Central to this evolution is the rough-mereolog"

["Piotr Synak", "James F. Peters", "Andrzej Skowron", "Sheela Ramanna"]

0.026

The top article by some distance on this metric is "Rough sets", which was written more than 25 years ago. I found it interesting that the abstract talks about it being an approach that is fundamental to AI and machine learning, which are important fields in 2021.

We can have a look at the hub nodes that point to these articles by running the following query:

MATCH (a:Article)
WITH a, [(a)<-[:CITED]-(other) | other] AS citations
WITH a, apoc.coll.sortNodes(citations, "pregel_hub")[..5] AS topHubs
RETURN a.title, a.year,
       round(a.pregel_auth, 3) AS auth,
       [c in topHubs | {article: c.title, score: round(c.pregel_hub, 3)}] AS topHubs
ORDER BY auth DESC
LIMIT 10;
Table 6. Results
a.title a.year auth topHubs

"Rough sets"

1995

0.99

[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}]

"Fuzzy Similarity Relation as a Basis for Rough Approximations"

1998

0.042

[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Lattices with Interior and Closure Operators and Abstract Approximation Spaces"}]

"Toward Intelligent Systems: Calculi of Information Granules"

2001

0.042

[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.072, article: "Rough sets and information granulation"}, {score: 0.071, article: "A Note on Ziarko’s Variable Precision Rough Set Model and Nonmonotonic Reasoning"}, {score: 0.071, article: "A Partition Model of Granular Computing"}]

"Approximation spaces and information granulation"

2005

0.038

[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}]

"Layered learning for concept synthesis"

2004

0.037

[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "P300 wave detection based on rough sets"}]

"A Comparison of Several Approaches to Missing Attribute Values in Data Mining"

2000

0.036

[{score: 0.082, article: "Some Issues on Rough Sets"}, {score: 0.075, article: "The rough set exploration system"}, {score: 0.071, article: "Missing template decomposition method and its implementation in rough set exploration system"}, {score: 0.07, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.07, article: "Characteristic relations for incomplete data: a generalization of the indiscernibility relation"}]

"Variable Consistency Model of Dominance-Based Rough Sets Approach"

2000

0.029

[{score: 0.072, article: "Rough Set Analysis of Preference-Ordered Data"}, {score: 0.072, article: "Variable-precision dominance-based rough set approach"}, {score: 0.071, article: "On variable consistency dominance-based rough set approaches"}, {score: 0.071, article: "Multicriteria choice and ranking using decision rules induced from rough approximation of graded preference relations"}, {score: 0.07, article: "Rough set approach to customer satisfaction analysis"}]

"RSES and RSESlib - A Collection of Tools for Rough Set Computations"

2000

0.026

[{score: 0.079, article: "Approximate boolean reasoning: foundations and applications in data mining"}, {score: 0.073, article: "Hybridization of rough sets and statistical learning theory"}, {score: 0.072, article: "Ontology driven concept approximation"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "A statistical method for determining importance of variables in an information system"}]

"A New Version of Rough Set Exploration System"

2002

0.026

[{score: 0.075, article: "Multimodal classification: case studies"}, {score: 0.072, article: "Processing of musical data employing rough sets and artificial neural networks"}, {score: 0.069, article: "Introducing a rule importance measure"}, {score: 0.069, article: "NetTRS induction and postprocessing of decision rules"}, {score: 0.069, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}]

"Rough sets and information granulation"

2003

0.026

[{score: 0.083, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}, {score: 0.079, article: "A treatise on rough sets"}, {score: 0.075, article: "On generalized rough fuzzy approximation operators"}, {score: 0.074, article: "Matching 2d image segments with genetic algorithms and approximation spaces"}, {score: 0.071, article: "Time complexity of decision trees"}]

Based on the top hubs, it’s not really obvious why the authority score for "Rough sets" is so much higher than the other articles. Perhaps if we return the max, min, and average hub scores we’ll be able to figure it out?

MATCH (a:Article)
WITH a, [(a)<-[:CITED]-(other) | other] AS citations
RETURN a.title, a.year,
       round(a.pregel_auth, 3) AS auth,
       round(apoc.coll.max([c in citations | c.pregel_hub]), 3) AS maxHub,
       round(apoc.coll.min([c in citations | c.pregel_hub]), 3) AS minHub,
       round(apoc.coll.avg([c in citations | c.pregel_hub]), 3) AS averageHub,
       size(citations) AS citations
ORDER BY auth DESC
LIMIT 10;
Table 7. Results
a.title a.year auth maxHub minHub averageHub citations

"Rough sets"

1995

0.99

0.083

0.068

0.069

211

"Toward Intelligent Systems: Calculi of Information Granules"

2001

0.042

0.083

0.003

0.036

17

"Fuzzy Similarity Relation as a Basis for Rough Approximations"

1998

0.042

0.082

0.003

0.061

10

"Approximation spaces and information granulation"

2005

0.038

0.083

0.005

0.055

10

"Layered learning for concept synthesis"

2004

0.037

0.083

0.003

0.05

11

"A Comparison of Several Approaches to Missing Attribute Values in Data Mining"

2000

0.036

0.082

0.002

0.048

11

"Variable Consistency Model of Dominance-Based Rough Sets Approach"

2000

0.029

0.072

0.004

0.061

7

"RSES and RSESlib - A Collection of Tools for Rough Set Computations"

2000

0.026

0.079

0.002

0.032

12

"Rough sets and information granulation"

2003

0.026

0.083

0.005

0.065

6

"A New Version of Rough Set Exploration System"

2002

0.026

0.075

0.002

0.038

10

From this output we learn that "Rough sets" is being cited by a lot of articles with a good hub score. The other articles have a similar maxHub score and some even have a similar averageHub, but their minHub is significantly less. It also has 10x as many citations as any of the other articles in the top 10, so that would contribute to the higher score as well.

HITS Authority vs PageRank

The HITS Authority score and the PageRank algorithm both compute scores that indicate the importance of a node in a graph, so I was curious whether there was any correlation between the scores. i.e. do the nodes with the highest HITS authority score also have a high PageRank score?

The PageRank algorithm measures the importance of each node within the graph, based on the number incoming relationships and the importance of the corresponding source nodes.

We can compute the PageRank score for articles, by running the following query:

CALL gds.pageRank.write("citation_graph", {
  maxIterations: 20,
  writeProperty: "pagerank"
})
YIELD writeMillis, nodePropertiesWritten, ranIterations, postProcessingMillis, createMillis, computeMillis;
Table 8. Results
writeMillis nodePropertiesWritten ranIterations postProcessingMillis createMillis computeMillis

29

51956

20

0

0

112

And now let’s put the PageRank scores alongside the HITS Authority scores:

MATCH (a:Article)
RETURN a.title, a.year,
       round(a.pregel_auth, 3) AS auth,
       round(a.pagerank, 3) AS pagerank,
       size([(a)<-[:CITED]-(other) | other]) AS citations
ORDER BY auth DESC
LIMIT 10;
Table 9. Results
a.title a.year auth pagerank citations

"Rough sets"

1995

0.99

25.609

211

"Fuzzy Similarity Relation as a Basis for Rough Approximations"

1998

0.042

0.738

10

"Toward Intelligent Systems: Calculi of Information Granules"

2001

0.042

1.862

17

"Approximation spaces and information granulation"

2005

0.038

0.418

10

"Layered learning for concept synthesis"

2004

0.037

0.505

11

"A Comparison of Several Approaches to Missing Attribute Values in Data Mining"

2000

0.036

0.896

11

"Variable Consistency Model of Dominance-Based Rough Sets Approach"

2000

0.029

0.471

7

"RSES and RSESlib - A Collection of Tools for Rough Set Computations"

2000

0.026

1.296

12

"A New Version of Rough Set Exploration System"

2002

0.026

0.682

10

"Rough sets and information granulation"

2003

0.026

0.375

6

Rough Sets is the only one with a high PageRank score as well. In fact, its PageRank score is the 3rd highest in the graph, which we can see by running the following query:

MATCH (a:Article)
RETURN a.title, a.year,
       round(a.pregel_auth, 5) AS auth,
       round(a.pagerank, 5) AS pagerank,
       size([(a)<-[:CITED]-(other) | other]) AS citations
ORDER BY pagerank DESC
LIMIT 10;
Table 10. Results
a.title a.year auth pagerank citations

"A method for obtaining digital signatures and public-key cryptosystems"

1978

5.0E-5

93.94313

125

"Secure communications over insecure channels"

1978

0.0

79.86924

7

"Rough sets"

1995

0.9902

25.60911

211

"An axiomatic basis for computer programming"

1969

4.4E-4

23.02937

93

"Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems"

2001

0.0

21.46956

108

"SCRIBE: The Design of a Large-Scale Event Notification Infrastructure"

2001

0.0

19.4863

14

"A field study of the software design process for large systems"

1988

0.0

19.02815

53

"Productivity factors and programming environments"

1984

0.0

18.49935

5

"Analyzing medium-scale software development"

1978

0.0

16.45275

5

"A Calculus of Communicating Systems"

1982

0.0

15.43059

55

I find it kinda interesting that while these articles have very high transitive importance, their HITS Authority score is very low. Many of them have a lot of citations as well, but presumably most of those citations aren’t from hub nodes.

Analysing hub scores

Speaking of hubs, let’s explore those in a bit more detail. We can find the articles with the highest hub score, by running the following query:

MATCH (a:Article)
WITH a, [(a)-[:CITED]->(other) | other] AS cited
RETURN a.title, a.year,
       round(a.pregel_hub, 3) AS hub,
       round(apoc.coll.max([c in cited | c.pregel_auth]), 3) AS maxAuth,
       round(apoc.coll.min([c in cited | c.pregel_auth]), 3) AS minAuth,
       round(apoc.coll.avg([c in cited | c.pregel_auth]), 3) AS averageAuth,
       size(cited) AS cited
ORDER BY a.pregel_hub DESC
LIMIT 10;
Table 11. Results
a.title a.year hub maxAuth minAuth averageAuth cited

"Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"

2005

0.083

0.99

0.006

0.102

12

"Some Issues on Rough Sets"

2004

0.082

0.99

0.006

0.134

9

"A treatise on rough sets"

2005

0.079

0.99

0.005

0.145

8

"Approximate boolean reasoning: foundations and applications in data mining"

2006

0.079

0.99

0.005

0.115

10

"Multimodal classification: case studies"

2006

0.075

0.99

0.005

0.122

9

"The rough set exploration system"

2005

0.075

0.99

0.005

0.157

7

"On generalized rough fuzzy approximation operators"

2006

0.075

0.99

0.026

0.274

4

"Lattices with Interior and Closure Operators and Abstract Approximation Spaces"

2009

0.074

0.99

0.005

0.136

8

"Matching 2d image segments with genetic algorithms and approximation spaces"

2006

0.074

0.99

0.005

0.154

7

"Hybridization of rough sets and statistical learning theory"

2011

0.073

0.99

0.005

0.214

5

The maxAuth scores tell us that all of these articles cite the "Rough sets" article that we came across in the previous section. There aren’t really any other articles with a high authority score, so we can assume that nearly all of the hub score is coming from citing "Rough sets". In any case, let’s have a look at the other authorities that these articles have cited:

MATCH (a:Article)
WITH a, [(a)-[:CITED]->(other) | other] AS cited
WITH a, apoc.coll.sortNodes(cited, "pregel_auth")[..5] AS topAuthorities
RETURN a.title, a.year,
       round(a.pregel_hub, 3) AS hub,
       [c in topAuthorities | {article: c.title, score: round(c.pregel_auth, 3)}] AS topAuthorities
ORDER BY hub DESC
LIMIT 10;
Table 12. Results
a.title a.year hub topAuthorities

"Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"

2005

0.083

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}]

"Some Issues on Rough Sets"

2004

0.082

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Toward Intelligent Systems: Calculi of Information Granules"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}]

"A treatise on rough sets"

2005

0.079

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "Rough sets and information granulation"}]

"Approximate boolean reasoning: foundations and applications in data mining"

2006

0.079

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "RSES and RSESlib - A Collection of Tools for Rough Set Computations"}, {score: 0.021, article: "Some Issues on Rough Sets"}]

"Multimodal classification: case studies"

2006

0.075

[{score: 0.99, article: "Rough sets"}, {score: 0.037, article: "Layered learning for concept synthesis"}, {score: 0.026, article: "A New Version of Rough Set Exploration System"}, {score: 0.015, article: "The rough set exploration system"}, {score: 0.01, article: "Rough Set Methods in Approximation of Hierarchical Concepts"}]

"The rough set exploration system"

2005

0.075

[{score: 0.99, article: "Rough sets"}, {score: 0.036, article: "A Comparison of Several Approaches to Missing Attribute Values in Data Mining"}, {score: 0.021, article: "Rough Sets and Decision Algorithms"}, {score: 0.021, article: "In Pursuit of Patterns in Data Reasoning from Data The Rough Set Way"}, {score: 0.015, article: "Classification of Swallowing Sound Signals: A Rough Set Approach"}]

"On generalized rough fuzzy approximation operators"

2006

0.075

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}]

"Lattices with Interior and Closure Operators and Abstract Approximation Spaces"

2009

0.074

[{score: 0.99, article: "Rough sets"}, {score: 0.042, article: "Fuzzy Similarity Relation as a Basis for Rough Approximations"}, {score: 0.024, article: "Approximation Operators in Qualitative Data Analysis"}, {score: 0.015, article: "Data with Missing Attribute Values: Generalization of Indiscernibility Relation and Rule Induction"}, {score: 0.005, article: "Algebraic structures for rough sets"}]

"Matching 2d image segments with genetic algorithms and approximation spaces"

2006

0.074

[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "Rough sets and information granulation"}, {score: 0.01, article: "K-means Indiscernibility Relation over Pixels"}, {score: 0.006, article: "Rough ethology: towards a biologically-inspired study of collective behavior in intelligent systems with approximation spaces"}]

"Hybridization of rough sets and statistical learning theory"

2011

0.073

[{score: 0.99, article: "Rough sets"}, {score: 0.038, article: "Approximation spaces and information granulation"}, {score: 0.026, article: "RSES and RSESlib - A Collection of Tools for Rough Set Computations"}, {score: 0.01, article: "Accuracy and Coverage in Rough Set Rule Induction"}, {score: 0.005, article: "Generalized indiscernibility relations: applications for missing values and analysis of structural objects"}]

The top 2 articles both cited "Toward Intelligent Systems: Calculi of Information Granules", which gives them a marginally higher score than the other 8. But I don’t think these hub scores are telling us all that much about these articles.

In Summary

While I’m not sure that this is the greatest data set to show off this algorithm, I think the algorithm itself is an interesting addition to the library. I’m curious to see how well it would fare on a Twitter graph - perhaps the HITS Hub score would help to identify those accounts that primarily tweet out links to interesting content? I guess that exploration will have to wait for another post!

  • LinkedIn
  • Tumblr
  • Reddit
  • Google+
  • Pinterest
  • Pocket