Cardinality estimation in parallel

Introduction 🧮

First, what is cardinality? It’s counting the unique elements of a field. In SQL, something like SELECT DISTINCT field FROM table. You can definitely count unique elements this way but the trouble begins when you want to do quick analyses on big data. If you want to explore relationships between two variables, it may involve multiple GROUP BYs and other operations for every pair of variables that you want to explore. This is especially tedious and expensive if you were to explore every combination of fields. I’m no expert in big O notation estimates, but this sounds like O(n²) to me. And O(n²) is considering pairs, it could grow to O(nᵐ) if you wanted to compare m columns. With probalistic data structures like HyperLogLog and MinHash, we can compute this for every column, so then the cost is only around O(n). See the references listed at the bottom for in-depth discussions on probalistic data structures, the class of data structures that HyperLogLog and MinHash belong to.

Read More

3D RDF Graph Demo

While learning SPARQL (see previous post) I explored the tutorials of the Brick Schema to get a handle on making queries on an RDF database (using the RDFlib python package). One small complaint about RDFlib is that there is not an obvious way of outputting query results as a nice table. I learned, however, that the output can be outputted to a JSON string and from there easily turned into a dictionary or pandas dataframe. Additionally, you can programmatically replace the prefixes:

Read More