Anytime Large-Scale Analytics of Linked Open Data
Analytical queries are queries with numerical aggregators: computing the average number of objects per property, identifying the most frequent subjects, etc. Such queries are essential to monitor the quality and the content of the Linked Open Data (LOD) cloud. Many analytical queries cannot be executed directly on the SPARQL endpoints, because the fair use policy cuts off expensive queries. This paper shows how to rewrite such queries into a set of queries that each satisfy the fair use policy.
Publication
Anytime Large-Scale Analytics of Linked Open Data
Arnaud Soulet and Fabian M. Suchanek
Full paper at ISWC19 (research track)
Results of experiments
- Validation on DBpedia:
- Quality of convergence: dbpedia_convergence.csv
- Top-k precision: dbpedia_top.csv
- Efficiency: dbpedia_efficiency.csv
- Usage query on the LOD cloud: LOD_statistics.csv
- Representativeness of LOD: LOD_representativeness.csv
Source code
We provide the Java source code of the prototype:
- Usage query: This implementation measures property usage and class usage. Source code
- Representativeness of LOD: This implementation computes for each property, a distribution over the frequency of the first significant digit of the number of objects per subject. We used the method proposed here to convert this distribution into a score between 0 and 1 that measures the "representativeness" of the triplestores. Source code
How to
- Install the database schema by running the sql script in sql directory (for MySQL 8)
- Configure the application by modifying the property files in properties/
- robusta_file.properties: thread configuration (producer_number and consumer_number) / database configuration
- file.properties: database configuration
- Run the application
- TestXxx: test the application for one SPARQL endpoint (developpement purpose)
- LODXxxApplication: run the application with a server and a client (production purpose)