Article, 2024

Efficient Skyline Keyword-Based Tree Retrieval on Attributed Graphs

IEEE Transactions on Knowledge and Data Engineering, ISSN 1041-4347, 1558-2191, Volume PP, 99, Pages 1-14, 10.1109/tkde.2024.3388988

Contributors

Wu, Dingming [1] Zhang, Zhaofen [1] Jensen, Christian Søndergaard 0000-0002-9697-7670 [2] Lu, Kezhong [1]

Affiliations

  1. [1] Shenzhen University
  2. [NORA names: China; Asia, East];
  3. [2] Aalborg University
  4. [NORA names: AAU Aalborg University; University; Denmark; Europe, EU; Nordic; OECD]

Abstract

Attributed graphs are graphs, where the vertices have attributes. Such graphs encompass, e.g., social network graph, citation graphs, and knowledge graphs, which have numerous real-world applications. Keyword-based search is a prominent and user-friendly way of querying attributed graphs. One widely used approach to keyword search adopts tree-based query semantics that relies on scoring functions that aggregate distances from a root to keyword-matched vertices. However, it is non-trivial to design scoring functions that capture different users’ keyword preferences. This study defines and solves the skyline KTree retrieval problem that combines keyword querying with skyline functionality on attributed graphs. The result of a skyline KTree query is independent of scoring functions. Hence, no matter which keywords are preferred, users can always find their favorite KTrees in a result. To enable efficient skyline KTree retrieval, we propose algorithm $\mathsf {FilterRefine}$ that first identifies candidate results and then uses them for search space pruning. Computing distances between keywords and vertices is expensive and dominates the computational cost of $\mathsf {FilterRefine}$. Inspired by subspace skyline query techniques, we convert the skyline KTree retrieval problem into a multi-dimensional subspace skyline problem and propose algorithm $\mathsf {MultiDiSkylineOpt}$. This algorithm is able to reuse skylines in subspaces and uses bounds on all dimensions to accelerate distance computation. Experimental results on real datasets show that a baseline algorithm cannot report results within a 500 second cut-off time, while the proposed algorithms are able to compute results in reasonable time. In particular, $\mathsf {MultiDiSkylineOpt}$ is able to efficiently retrieve skyline KTrees on large graphs with millions of nodes and hundreds of millions of edges.

Keywords

Keyword-based search, Ktree, NO, aggregate distance, aggregation, algorithm, applications, attribute graph, attributes, baseline, baseline algorithms, bounds, citation graph, citations, computational cost, computer, computing distances, cost, cut-off time, dataset, design scoring functions, dimensions, distance, distance computation, edge, experimental results, function, graph, keyword preferences, keywords, knowledge, knowledge graph, network graph, nodes, preferences, problem, query, query techniques, results, retrieval, retrieval problem, root, scores, scoring function, search, semantics, skyline, skyline problem, social network graph, study, subspace, technique, time, users, vertices

Funders

  • National Natural Science Foundation of China

Data Provider: Digital Science