jarnaldich.me

Near Duplicates Detection

Joan Arnaldich — Sun, 19 Feb 2023 00:00:00 UT

Near Duplicates Detection

Posted on February 19, 2023

In my previous post I set up a tool to ease the download of open datasets into a JupyterLite environment, which is a neat tool to perform simplish data wrangling without local installation.

In this post we will put that tool to good use for one of the most common data cleaning utilities: near duplicate detection.

Why bother about near duplicates?

Near duplicates can be a sign of a poor schema implementation, especially when they appear in variables with finite domains (factors). For example, in the following addresses dataset:

kind	name	number
road	Abbey	3
square	Level	666
drive	Mullholand	1
boulevard	Broken Dreams	4

The “kind” variable could predictably take any of the following values:

road
square
avenue
drive
boulevard

The problem is that this kind of data is too often modelled as an unconstrained string, which makes it error prone: ‘sqare’ is just as valid as ‘square’. This generates all kind of problems down the data analysis pipeline: what would happen if we analyze the frequency of each kind?

There are ways to ensure that the variable “kind” can only take one of those values, depending on the underlying data infrastructure:

In relational databases one could use domain types , data validation triggers, or plain old dictonary tables with 1:n relationships.
Non-relational DBs may have other ways to ensure schema conformance, e.g. through JSON schema or XML schema.
The fallback option is to guarantee this “by construction” via application validation, (eg. using drop-downs in the UI), although this is a weaker solution since it incurs in unnecessary coupling… and thing can go sideways anyway, so in this scenario you should consider performing periodic schema validation tests on the data.

Notice that all of these solutions require a priori knowledge of the domain.

But what happens when we are faced with an (underdocumented) dataset and asked to use it as a source for analysis? Or when we are asked to derive these rules a posteriori eg. to improve a legacy database? Well, without knowledge of the domain, it is just not possible to decide wether two similar values are both correct (and just happen to be spelled similarly) or a misspelling. The best thing we can do is to detect which values are indeed similar and raise a flag.

This is when the techniques explained in this blog post come handy.

The algorithm

For the sake of simplicity, in this blog post we will assume our data is small enough so that a quadratic algorithm is acceptable (for the real thing, see the references at the end). Beware that, in modern hardware, this simple case can take you farther than you would initially expect. My advise is to always use the simplest solution that gets the job done. It usually pays off in both development time and incidental complexity (reliance on external dependencies, etc…).

There are two main metrics regarding similarity. The first one, restricted to strings, is the Levenshtein (aka edit) distance and represents the number of edits needed to go from one string to another. This metric is hard to scale in general, since it requires pairwise comparison.

The other one is both more general and more scalable. It involves generating n-gram sets and then comparing them using a set-similarity measure.

N-gram sets

For each string, we can associate a set of n-grams that can be derived from it. N-grams (sometimes called shingles) are just substrings of length n. A typical case is n=3, which generates what is known as trigrams. For example, the trigram set for the string "algorithm" would be ['alg', 'lgo', 'gor', 'ori', 'rit', 'ith', 'thm'].

Jaccard Index

Once we have the n-gram set for a string, we can use a general metric for set similarity. A popular one is the Jaccard Index. Which is defined as the ratio between the cardinality of intersection over the cardinality of the union of any two sets.

\[J(A,B) = \frac{|A \bigcap B|}{|A \bigcup B|}\]

Note that this index will range from 0, for disjoint sets, to 1, for exactly equal sets.

If we were to scale…

The advantadge of using n-gram sets is that we can use similarity-preserving summaries of those sets (eg. via minhashing) which, combined with locality sensitive hashing to efficiently compare pairs of sets provides a massively scalable solution. In this post we will just assume that the size of our data is small enought so that we do not need to scale.

The Code

All the above can be implemented in the following utility function, which will take an iterable of strings and the minimum jaccard similarity and max levenshtein distance to consider a pair a candidate for duplicity. It will return a pandas dataframe with the pair indices, their values, and their mutual Levenshtein and Jaccard distances. We will use the Natural Languate Toolkit for the implementation of those distances.

Bear in mind that, in a real use case, we would very likely apply some normalization before testing for near duplicates (eg. to account for spaces and/or differences in upper/lowercase versions).

def near_duplicates(factors, min_jaccard: float, max_levenshtein: int):
  trigrams = [ set(''.join(g) for g in nltk.ngrams(f, 3)) for f in factors ]
  jaccard = dict()
  levenshtein = dict()
  for i in range(len(factors)):
    for j in range(i+1, len(factors)):
      denom = float(len(trigrams[i] | trigrams[j]))
      if denom > 0:
        jaccard[(i,j)] = float(len(trigrams[i] & trigrams[j])) / denom
      else:
        jaccard[(i,j)] = np.NaN
      levenshtein[(i,j)] = nltk.edit_distance(factors[i], factors[j])

  acum = []
  for (i,j),v in jaccard.items():
    if v >= min_jaccard and levenshtein[(i,j)] <= max_levenshtein: 
      acum.append([i,j,factors[i], factors[j], jaccard[(i,j)], levenshtein[(i,j)]])

  return pd.DataFrame(acum, columns=['i', 'j', 'factor_i', 'factor_j', 'jaccard_ij', 'levenshtein_ij'])

We can extend the above functions to explore a set of columns in a pandas data frame with the following code:

def df_dups(df, cols=None, except_cols=[], min_jaccard=0.3, max_levenshtein=4):
  acum = []
  
  if cols is None:
    cols = df.columns

  if isinstance(min_jaccard, numbers.Number):
    mj = defaultdict(lambda : min_jaccard)
  else:
    mj = min_jaccard

  if isinstance(max_levenshtein, numbers.Number):
    ml = defaultdict(lambda: max_levenshtein)
  else:
    ml = max_levenshtein

  for c in cols:

    if c in except_cols or not is_string_dtype(df[c]):
      continue

    factors = df[c].factorize()[1]
    col_dups = near_duplicates(factors, mj[c], ml[c])
    col_dups['col'] = c
    acum.append(col_dups)

  return pd.concat(acum)

If we apply the above code to the open dataset from the last blog post

df_dups(df, cols=['Proveïdor',
       'Objecte del contracte', 
       'Tipus Contracte'])

The column names are in Catalan since the dataset comes from the Barcelona Council Open Data Hub, and stand for the contractor, the service descripction, and the type of service.

We get the following results:

Notice that the first two are actually valid, despite being similar (two companies with similar names and electric vs electronic supplies), while the last two seem to be a case of not controlling the variable domain properly (singular/plural entries). We should definitely decide for a canonical value (singular/plural) for the column “Tipus Contracte” before we compute any aggregation for those columns.

Conclusions

We can use the above functions as helpers prior to performing some analysis on datasets where domain rules have not been previously enforced. They are compatible with JupyterLite, so no need to install anything for the test. For convenience, you can find a working notebook in this gist.

References

Mining Of Massive Datasets - An absolute classic book. Chapter 3, in particular, describes a scalable improvement on the technique described in this blog post.

Tags: jupyterlite, data, nltk, jaccard, qc

Dealing with CORS in JupyterLite

Joan Arnaldich — Sun, 29 Jan 2023 00:00:00 UT

Dealing with CORS in JupyterLite

Posted on January 29, 2023

Following my previous post, I am intending to see how far I can push JupyterLite as a platform for data analysis in the browser. The convenience of having a full enviroment with a sensible default set of libraries for dealing with data one link away is really something I could use.

But of course, for data analysis you need… well… data. There is certainly no shortage of public datasets on the internet, many of them falling into some sort of Open Data initiatives, such as the EU Open Data.

But, as soon as you try to use JupyterLite to directly fetch data from those sites, you find yourself stumping on a wall named Same Origin Policy.

Same Origin Policy

The Same Origin Policiy is a protection system designed to guarantee that resource providers (hosts) can restrict usage of their data to the pages they host. This is the safe thing to do when there is user data involved, since it prevents third parties to gain access to eg. the user’s cookies and session id’s.

Notice that, when there is no user data involved, it is perfectly safe to relax this policy. In fact, as we will see, it is desirable to do so.

Browsers implement this protection by not allowing a page to perform requests to a server that is different from where it was downloaded unless this other server explicitly allows for it.

This behaviour bites hard at any application involving third party data analysis in the browser, as well as a lot of webassembly “ports” of existing applications with networking capabilities, since the original desktop apps were not designed to deal with this kind of restrictions¹ in the first place.

For example, if you are using the Jupyterlite at jupyterlite.github.io, you will not be able to fetch any server beyond github.io that does not allow for it specifically… which many data providers don’t. The request will be blocked by the browser itself (step 2 in the diagram above). You will either need to download yourself the data and upload it to JupyterLite, or self-host JupyterLite and the data in your own server (using it as a proxy for data requests), which kinda takes all the convenience out of it. As an example, evaluating this snippet in JupyterLite works exactly as you would expect:

import pandas as pd
from js import fetch

WORKS = "https://raw.githubusercontent.com/jupyterlite/jupyterlite/main/examples/data/iris.csv"
WORKS_CORS_ENABLED  = "https://data.wa.gov/api/views/f6w7-q2d2/rows.csv?accessType=DOWNLOAD"
FAILS_CORS_DISABLED = "https://opendata-ajuntament.barcelona.cat/data/dataset/1121f3e2-bfb1-4dc4-9f39-1c5d1d72cba1/resource/69ae574f-adfc-4660-8f81-73103de169ff/download/2018_menors.csv"

res = await fetch(WORKS)
text = await res.text()
print(text)

There are two ways in which a data provider can accept cross-origin requests. The main one (the canonical, modern one) is known as Cross Origin Resource Sharing (CORS). By adding explicit permission in some dedicated HTTP headers, a resource provider can control who can access their data (the world or selected domains) and how (which HTTP methods).

Whenever this is not possible or practical (it needs access to the HTTP server configuration, and some hosting providers may not allow it), there is a second way: the JSONP callback.

The JSONP Callback

The JSONP callback works along these lines:

The calling page (eg. JupyterLite) defines a callback function, with a data parameter.
The calling page (JupyterLite) loads a script from the data provider, passing the name of the callback function.
The data provider script calls back the function with the requested data.

Since the script was downloaded from the data provider’s domain, it can perform requests to that domain, so CORS restrictions do not apply.

This is not the recommended solution because it delegates to the application something that belongs to another layer: both the server and the consuming webpage have to modified. One typical use case is making older browsers work. The other is kind of accidental: downloading from (poorly configured?) Open Data portals. Most Open Data portals (including administrative ones) use pre-built data management systems such as CKAN. These often can handle JSONP by default, while http servers have CORS disabled by default. So keeping the defaults leaves you with JSONP.

Implementing a JSONP helper in JupyterLite

One of the things I love about the browser as a platform is that it is… pretty hackable… just press F12 and you can enter the kitchen. For example, you can see how JupyterLite “fakes” its filesystem on top of IndexedDB, wich is an API for storing persistent data in the browser.

So, we have a way to perform CORS requests and get data from a server implementing JSONP, and we can also fiddle with JupyterLite’s virtual filesystem… would it be possible to write a helper to download datasets into the virtual filesystem? You bet! Just paste the following code in a javascript kernel cell, or use the %%javascript magic in a python one:

window.saveJSONP = async (urlString, file_path, mime_type='text/json', binary=false) => {
    const sc = document.createElement('script');
    var url = new URL(urlString);
    url.searchParams.append('callback', 'window.corsCallBack');
    
    sc.src = url.toString();

    window.corsCallBack = async (data) => {
        console.log(data);

        // Open (or create) the file storage
        var open = indexedDB.open('JupyterLite Storage');

        // Create the schema
        open.onupgradeneeded = function() {
            throw Error('Error opening IndexedDB. Should not ever need to upgrade JupyterLite Storage Schema');
        };

        open.onsuccess = function() {
            // Start a new transaction
            var db = open.result;
            var tx = db.transaction("files", "readwrite");
            var store = tx.objectStore("files");

            var now = new Date();

            var value = {
                'name': file_path.split(/[\\/]/).pop(),
                'path': file_path,
                'format': binary ? 'binary' : 'text',
                'created': now.toISOString(),
                'last_modified': now.toISOString(),
                'content': JSON.stringify(data),
                'mimetype': mime_type,
                'type': 'file',
                'writable': true
            };      

            const countRequest = store.count(file_path);
            countRequest.onsuccess = () => {
              console.log(countRequest.result);
                if(countRequest.result > 0) {
                    store.put(value, file_path);
                } else {
                    store.add(value, file_path);
                }   
            }; 

            // Close the db when the transaction is done
            tx.oncomplete = function() {
                db.close();
            };
        }
    }

    document.getElementsByTagName('head')[0].appendChild(sc);
}

Then, each time you need to download a file, you can just use the following javascript:

%%javascript
var url = 'https://opendata-ajuntament.barcelona.cat/data/es/api/3/action/datastore_search?resource_id=69ae574f-adfc-4660-8f81-73103de169ff'
window.saveJSONP(url, 'data/menors.json')

To clarify, you should either use a python kernel with the %%javascript magic or the javascript kernel in both the definition and the call, otherwise they won’t see each other.

Then from a python cell we can read it the standard way:

import json
import pandas as pd

with open('data/menors.json', 'r') as f:
  data = json.load(f)
  
pd.read_json(json.dumps(data['result']['records']))

You can find a notebook with the whole code for your convenience in this GIST.

Conclusions

We are just starting to see the potential of WebAssembly based solutions and the browser environment (IndexedDB…). This will increase the demand for data accessibility across origins.
If you are a data provider, please consider enabling CORS to promote the usage of your data. Otherwise you will be banning a growing market of web-based analysis tools from your data.

References

Simple IndexedDB example
Sample code for reading and writing files in JupyterLite (this is where the idea for this post comes from).
On CORS and how to enable it.
An w3 article on how to open your data by enabling CORS and why it is important, with a list of providers implementing it.
A test web page to check if a server is CORS enabled.

If you are curious about the possible solutions to this problems, you may like to read how WebVM, a server-less virtual Debian, implements a general solution here.↩︎

Tags: jupyterlite, CORS, data, data, webassembly

Data Manipulation with JupyterLite

Joan Arnaldich — Thu, 08 Dec 2022 00:00:00 UT

Data Manipulation with JupyterLite

Posted on December 8, 2022

Data comes in all sizes, shapes and qualities. The process of getting data ready for further analysis is equally crucial and tedious, as many data professionals will confirm.

This process of many names (data wrangling/munging/cleaning) is often performed by an unholy mix of command-line tools, one-shot scripts and whatever is at hand depending on the data formats and computing environment.

I have been intending to share some of the tools I have found useful for this task in a series of blog posts, especially if they are particularly unexpected or lesser-known. I will always try to demonstrate the tool with some common data processing application, and then finally highlight which conditions the tool is most suitable under.

JupyterLite

Jupyter/JupyterLab are the de-facto standard notebook environment, especially among Python data scientists (although it was designed from the start to work for multiple languages or kernels as the name hints). The frontend runs in a browser, and setting up the backend often requires a local installation, although some providers will let you spin a backend in the cloud, see: Google Collab or The Binder Project.

JupyterLite is simpler/cleaner solution for simple analysis if sharing is not needed: it is a quite complete Jupyter environment where all its components run in the browser via webassembly compilation. Just visit its Github project page for the details. Following some of the referenced projects and examples is a worthy rabbit hole to enter.

Some things you might not expect from a webassembly solution:

Comes with most data-science libraries ready to use: matplotlib, pandas, numpy.
Can install third party packages via regular magic:

%pip install -q bqplot ipyleaflet

Example: Not so Simple Excel Manipulation

Sometimes you need to perform some not so simple manipulation in an Excel sheet that outgrows pivot tables but is kind of the bread and butter of pandas. Since copying an excel table defaults to a tabbed-separated string, getting a pandas dataframe is as easy as firing up jupyterlite by visiting this page, then getting a python notebook and evaluating this code in the first cell, pasting the excel table between the multi-line string separator:

import pandas as pd
import io

df = pd.read_table(io.StringIO("""

"""))
df

Highlights

Useful for: The kind of analysis/manipulation one would use Pandas / Numpy for, especially if it involves visualizations or richer interaction.
Useful when: You don’t have access to a pre-installed Jupyter environment but have a modern browser and intenet connection at hand, or when you are dealing with sensitive data that should not leave your computer.

Conclusion

JupyterLite is an amazing project: as with many webassembly based solutions, we are just starting to see the possibilities. I encourage you to explore it beyond data manipulation because you can easily find other applications for it, from interactive dashboards to authoring diagrams…

Tags: data, tools, jupyterlite, data-manipulation, data-wrangling, data-munging, webassembly

Cloud Optimized Vector

Joan Arnaldich — Fri, 22 Apr 2022 01:00:00 UT

Cloud Optimized Vector

Posted on April 22, 2022

A few days ago a coworker of mine sent me a recent article by Paul Ramsey (of Postgis et al. fame) reflecting on what would a Cloud Optimized Vector format look like. His shocking proposal was … (didn’t see that coming)… shapefiles!

Source: xkcd

I understand the article was written as a provocation for thought and as such makes some really good points. I also think that the general discussion over what a “cloud optimized vector” format would look like can be productive, but I am afraid that some less experienced developers (or, God forbid, managers!) would take the proposal of pushing shapefiles as the next cloud format a bit too literally, so I thought I would give some context and counterpoint to that article.

Him being Paul Ramsey and me being… well… me, I’d better motivate my opinion, so here comes a longish post. I will try to analyze what makes something cloud optimized based on the COG experience, see how that could be applied to a vector format, then justify why shapefiles should be (once again) avoided and finally see if we can get any closer to an ideal cloud vector format.

What makes something cloud optimized anyway?

Cloud Optimized GeoTiffs are technically just a name for a GeoTiff with a particular internal organization (the sequencing of the bytes on disk). Tiff is a old format (old as in venerable) that allows for huge flexibility in terms of internal storage, data types, etc… For example, an image can be stored on disk one line after the other or, as is the case with COG, in small square “mini images” called tiles. Those tiles are then arranged in a larger grid and then several coarser-resolution layers (called overviews) of such grids can be stacked together to form an image pyramid.

Source: OsGEO Wiki

Of course, all data is properly indexed within the file so that accessing a tile of any pyramid level is easy (seeking byte ranges and at most some trivial multiplications or additions).

Whenever data is fetched in chunks through a channel with some latency (be it disk transfer or network), the efficiency of the overall processing can be improved by organizing data in the same order it will be read by the algorithm to compensate for the cost of setting up each read operation (seek times of spinning disks or protocol overhead in network communications).

A corollary of this is being that data formats are not efficient per se, in the void: it will always depend on the process/algorithm/use case. For example, for a raster point operation (such as applying a threshold mask for some value), organizing data line by line with no overviews is more efficient than a COG would be (…and that is why the Geotiff spec allows for different configurations).

When dealing with spatial data, that principle gets hit by a loose version of Tobler’s First Law: data representing an area nearby is more likely to be accessed next. For example, when a user is viewing an image, tiles that are close to the ones on screen are more likely to be fetched next than tiles representing remote areas (because users pan, do not jump randomly).

So what is the use case COG is having in mind? Well, in case you hadn’t figured it out already, it is mainly visualization¹. Overviews allow for zooming in and out efficiently and tiles help with moving along a subset of the higher resolution.

This pattern has been the ABC of raster optimization for decades in the geospatial world. Be it tile caches, tiling schemes, WMS map servers, etc… they all² try to have the same properties:

Efficient navigation along contiguous resolutions (through overviews, pyramids, wavelets).
Efficient access of contiguous areas at a given resolution (tiling).

This also turns out to be a pretty sensible organization if you cannot know in advance what kind of processing will be performed, because it gives you fast access to a manageble piece of the data: be it a summary (overview) or a subset (a slice of tiles) or a combination of both.

Notice what it does not allow, though: it leaves you in the dry if you need a subset based on the content of the data: eg. I would like to see all pixels with a red channel value of 42: in that case you would have to read the whole image.

COG is just a name for a GeoTiff implementing that organization. It goes a bit further than that by forcing a particular order³ of the inner sections, which is smart because a client can ask for a chunk at the beginning and it will take all the directores (think indices, metadata) and probably some overviews, which makes sense, because most viewers will start with the lowest zoom that covers the bounding box. It is also a nice organization for streaming tiles of data.

With that in mind, what would it mean for a vector format to be “cloud ready”? Well it sure should allow for the visualization use case, and here it would mean loosely speaking “rendering a map”, so that gives us an idea:

Having the ability to navigate different zoom levels / scales / generalization(s).
Efficient rendering of nearby areas at a given resolutions.

Notice that point 1 as a process is much harder in vector than in raster formats: for rasters it is (mostly) a question of choosing what “summary” measure we pick for the overview pixel corresponding to the underlying level (nearest neighbor, interpolation, average, other…). Generalizing a vector is much harder, first because it can break topology and geometry validity in many ways, but also because deciding if/how to represent different features at different scales requires for cartographic design knowledge. But that is not relevant for the format itself, it just needs to be flexible enough to allow for different geometries at different resolutions and be efficient in navigating the different resolutions (we do not care how hard it was to generate the different resolution levels).

While I think these two requirements are the equivalent of what a COG offers for raster, I am unsure we would consider that enough in the vector case. For example, we might not consider acceptable not being able to have subsets or summaries based on attribute values, so there is a whole new level of complexity for vector at the format level as well. It all boils down if by vector we mean features or just geometries.

Now that I’ve established the two conditions I think define cloud optimization, at least by COG standards, let’s first dive into why I would say Shapefiles are not the future of the cloud.

The noble art of bashing shapefiles

A lot has been argued over the years on the problems with shapefiles. I will just refer here the problems specifically relevant in a cloud setting.

First, they are a multiple file format. There is a cost in the OS layer for opening a file (name resolution, checking permissions), and the web server will probably add another layer on top of that, so please let’s not choose a format for the cloud that means opening a .shp, .dbf, .prj, .dbx, .qix… and potentially all of these.

It’s limited to 2GB of file size. Most COGs are effectively BigTiffs, and easily need to go far beyond that. In any case, one of the reasons for moving to the cloud is being able to process larger data.

As for the use cases, they’re not even good for representation: you need several of them, one for each layer/geometry, to make most general maps (except maybe choropleths and other thematic maps). That already means multiplying the number of files even more.

Secondly, Paul’s article seems to care about property number 2: accessing contiguous areas at a given resolution. That is not cloud ready in the same way COGs are. We also need multi-scale map representation (property 1). You can of course use some sort of attribute to filter which elements should appear at different resolution levels, but that means attribute indexing and clashes with spatial ordering. The other option would be using different shapefiles for different layers so, even more files.

The tool for spatial ordering the article suggests would certainly be useful for a streaming algorithm where spatial contiguity is relevant, but then again there are options tailored for this use case.

Is there a better option?

For the representation use case which is what COGS provide, there certainly is, and has been around for a long time. It’s just that we call them vector tiles.

Vector tiles are exactly the application of the old tiling schema idea to vectors. It’s just that instead of mini-images, we have a pbf encoding of an format for encoding geometries and attributes.

Those tiles are then organized into a the same organization of grids and pyramids for different resolutions that we had in a COG. It’s just that most of the times, the tiling is not dependent of the dataset (though it can be), but globally fixed, with a set of well-known tile schemas.

The tiles can have different schemas and information at different resolution levels (zoom) to allow for different generalization and visualization options.

We can pack all those tiles into a single .mbtiles file, which is a sqlite-based format containing the tiles as a blob. Having a global tile scheme is nice because you can then use sqlite’s .attach command to merge datasets, for example. And you can include any metadata (projection, etc…) inside a single file.

And of course there are libraries for rendering them in the browser (that is their primary use case), among many other things. But Paul already knows that, since PostGis itself can generate them.

Are we there yet?

Well, for representation, at least we are close… but what if we want more complex queries over that (think spatial SQL)? With an .mbtiles alone you would need to actually decode each .pbf and query the attributes, so no luck there…

In a sqlite-based format (like .mbtiles or GeoPackage), it should be possible to add extra tables for queries that may or may not reference to the main tiles… but that’s an idea yet to be developed…

The other caveat for vector tiles is the possible loss of information as a general geometry repository. Internal VT coordinates are integers (mainly because they are optimal for screen rendering algorithms), so that means there is a discrete resolution for each zoom level. Special care has to be taken into account so that there is no loss of information (ie. making sure the zoom levels are enough for the internal raster cell to be below the resolution of the measuring instruments). So again, they may not be suitable for every application.

Conclusion

I hope I made my point on why I do not think shapefiles are the future of the cloud based vector formats (I wrote this in a bit of a hurry) and, more importantly, that the “cloud optimization” concept of the raster world can only be applied to the vector formats in a limited way. I do think there is an interesting space to explore, though… Of course I may be completely wrong and maybe Peter has actually found something.

Time will tell, I guess…

The trick is that some cloud processing platforms such as the Google Earth Engine are in fact processing on a visualization driven also called lazy processing scheme: only the data that is visualized at any moment by the user gets processed, on demand, so the same principle applies.↩︎
Actually, not all, there are more sophisticated methods like wavelet transforms allowing for multi-resolution decoding in formats like .ECW/MrSID (commercial) or JP2000, but for the purpose of this post let’s just call it a very sophisitcated pyramid.↩︎
For many applications, the hard requirements are tiles and overviews. The order of IFDs may not have much of an impact. I encourage the user to try and read a “regular” tiled tiff through /vsicurl/ in QGIS. Or even a raster geopackage, for that matter.↩︎

Tags: vector, vector-tiles, mbtiles, sqlite

ETL The Haskell Way

Joan Arnaldich — Sun, 27 Mar 2022 00:00:00 UT

ETL The Haskell Way

Posted on March 27, 2022

Extract Transform Load (ETL) is a broad term for processes that read a subset of data in one format, perform a more or less involved transformation and then store it in a (maybe) different format. Those processes can of course be linked together to form larger data pipelines. As in many such general terms, this can mean very different things in terms of software architecture and implementations. For example, depending on the scale of the data the solution may range from unix shell pipeline to a full-blown Apache nifi solution.

One common theme is data impedance mismatch between formats. Take for example JSON and XML. They are surely different, but for any particular application you can find a way to move data from one to the other. They even have their own traversal systems (jq’s syntax and XPath).

The most widely used solution for small to medium data is to write small ad-hoc scripts. One can somewhat abstract over these formats by abusing jq.

In this blog post we will explore more elegant way to perform such transformations using Haskell. The purpose of this post is just to pique your curiosity with what’s possible in this area with Haskell. It is definitely not intended as a tutorial on optics, which are not for Haskell beginners, anyways…

The Problem

We will be enriching a geojson dataset containing countries at a world scale taken from natural earth and enriching it with population data in xml as provided by the world bank API so that it can be used, for example, to produce a choropleth ~~map~~¹ visualization.

Haskell is a curiously effective fit for this kind of problems due to the unlikely combination of three seemingly unrelated traits: its parsing libraries (driven by a community interested in programming languages theory), optics (also driven by PLT and a gruesome syntax for record accessors, at least up to the recent addition of RecordDotSyntax), and the convience for writing scripts with the stack tool (driven by the olden unreliability of cabal builds).

It is the fact that Haskell is so abstract, that makes it easy to combine libraries never intended to work together in the first place. Haskell libraries tend to define its interfaces in terms of very general terms (eg. structures that can be mapped into, structures that can be “summarized”, etc..).

Let’s break down how these work together.

Parsing Libraries

Haskell comes from a long tradition of programming language theory applications, and it shines for building parsers, so there is no shortage of libraries for reading the most common formats. But, more important than the availability of parsing libraries itself, it’s the parse, don’t validate approach in this libraries that works here: most of them have the ability to decode (deserialize,parse) its input into a well typed structured value in memory (think Abstract Syntax Tree).

So a typical workflow would be to read the data from disk into a more or less abstract representation in memory involving nested data structures, then transform it into another representation in memory (maybe generated from a template) through the use of optics and then serialize it back to disk:

Optics

Optics (lenses, prisms, traversals) are way to abstract getters and setters in a composable way. Their surface syntax reads like “pinpointing” or “bookmarking” into a deeply nested data structure (think XPath), which make it nice for visually keeping track of what is being read or altered.

The learning curve is wild, and their error messages convoluted, but the fact that in Haskell we can abstract accessors away from any particular data structure, and that there are well-defined functions to combine them can reduce the size of your data transformation toolbox. And lighter toolboxes are easier to carry around with you.

Scripting

A lot of the data wrangling programs are one-shot scripts, where you care about the result more than about the software itself. Having to create a new app each time can be tiresome, so using scripting and knowing you can rely on a set of curated libraries to get the job done is really nice. Starting with a script that can be turned at any time into a full blown app that works on all the major platforms is a plus.

The Solution

The steps follow the typical workflow quite closely, in our case:

Parse the .xml file into a data structure (a document) in memory.
Build a map from country codes to population.
Read the geojson file with country info and get the array of features.
For each feature, create a new key with the population.

This overall structure can be traced in our main function:

main = do
  xml <- XML.readFile XML.def "population.xml" -- Parse the XML file into a memory document
  let pop2020Map = Map.fromList $ runReader records xml -- Build a map Country -> Population
  jsonBytes <- LB8.readFile "countries.geo.json" -- Parse the countries geojson into memory
  let Just json = Json.decode jsonBytes :: Maybe Json.Value
  let featureList = runReader (features pop2020Map) json :: [ Json.Value ] -- Get features with new population key
  let newJson = json & key "features"  .~ (Json.Array $ V.fromList featureList) -- Update the original Json
  LB8.writeFile "countriesWithPopulation.geo.json" $ Json.encode newJson -- Write back to disk

The form of the input data is not especially well suited for this app. The world population xml is basically a table in disguise (remember the data impedance problem?). It is basically a list of:

    
       name="Country or Area" key="ABW">Aruba
       name="Item" key="SP.POP.TOTL">Population, total
       name="Year">1960
       name="Value">54208

That means the function that reads it has to associate information from two siblings in the XML tree, but that is easy using the magnify function inside a Reader monad:

records :: Reader XML.Document [(T.Text, Scientific)]
records =
  let
    -- Lens to access an attribute from record to field. Intended to be composed.
    field name = nodes . folded . _Element . named "field" . attributeIs "name" name
  in do
    -- Zoom and iterate all records
    magnify (root . named "Root" ./ named "data" ./ named "record") $ do
      record <- ask
      let name = record ^? (field "Country or Area" . attr "key")
      let year = record ^? (field "Year" . text)
      let val  = record ^? (field "Value" . text)
      -- Returning a monoid instance (list) combines results.
      return $ case (name, year, val) of
        (Just key, Just "2020", Just val) -> [ (key, read $ T.unpack val) ]
        _ -> []

Note how lenses look almost like XPath expressions. The features function just takes the original features and appends a new key:

features :: Map.Map T.Text Scientific -> Reader Json.Value [ Json.Value ]
features popMap = do
  magnify (key "features" . values) $ do
    feature <- ask
    let Just id = feature ^? (key "id" . _String) -- Gross, but effective
    return $ case (Map.lookup id popMap) of
      Just pop -> [ feature & key "properties" . _Object . at "pop2020" ?~  Json.Number pop ]
      _ -> [ feature ]

That is really all it takes to perform the transformation. Please take a look at the full listing in this gist. Even with the imports, it cannot get much shorter or expressive than this fifty something lines…

Revenge of the Nerds

So Haskell turns out to be the most practical, straightforward solution I found for this kind of problems. Who knew?

I would absolutely not recommend learning Haskell just to solve this kind of problems (although I would absolutely recommend learning it for many other reasons). This is one of the occasions in which learning something just for the sake of it pays off in unexpected ways.

No lengend! No arrow pointing north! Questionable projection! This is not a post on map making, just an image to ease the reader’s eye after too much text for the internet…↩︎

Tags: haskell, data, xml, json, geojson

Finding Curve Inflection Points in PostGIS

Joan Arnaldich — Sun, 06 Feb 2022 00:00:00 UT

Finding Curve Inflection Points in PostGIS

Posted on February 6, 2022

In this blog post I will present a way to find inflection points in a curve. An easy way to understand this: imagine the curve is the road we are driving along, we want to find the points in which we stop turning right and start turning left or vice versa, as shown below:

We will show a sketch of the solution and a practial implementation with PostGIS.

A sketch of the solution

This problem can be solved with pretty standard 2d computational geometry resources. In particular, the use of the cross product as a way to detect if a point lies left or right of a given straight line will be useful here. The following pseudo-code is based on the determinant formula:

function isLeft(Point a, Point b, Point c){
     return ((b.X - a.X)*(c.Y - a.Y) - (b.Y - a.Y)*(c.X - a.X)) > 0;
}

In general, I am against implementing your own computational geometry code: the direct translation of mathematical formulas are often plagued with rounding-off errors, corner cases and blatant inefficiencies. You would be better off using one of the excellent computational geometry libraries such as: GEOS, which started as a port of the JTS, or CGAL. Chances are that you are using them anyway, since they lie at the bottom of many GIS software stacks. This holds true for any non-trivial mathematics (linear algebra, optimization…). Remember: floats are NOT real numbers

In this case, where I cared a lot more about practicality than sheer efficiency, the use of SQLs numeric types, which offer arbitrary precision arithmetics at the expense of speed, prevents some of the rounding-off errors we would get with double precision, sparing us to implement fast robust predicates ourselves.

PostGIS implementaton

I have long felt that Postgres/PostGIS is the nicest workbench for geospatial analysis (prove me wrong). In many use cases, being able to perform the analysis directly where your data is stored is unbeatable. Having to write a SQL script may be a throwback for some users, but works charms in terms of reproducibility and traceability for your data workflows.

In this particular case we will assume our input is a table with LineString geometry features, each one with its unique identifier. Of course, geometries are properly indexed and tested for validity before any calculation. It is also often useful during development to limit the calculation to a subset of the data through an area of interest in order to shorten the iteration process for testing results and parameters.

The sketch of the solution is:

Simplify the geometries to avoid noise (false positives). ST_Simplify or ST_SimplifyPreserveTopology will suffice.
Explode the points, keeping track of the original geometries, this can be easily done with generate_series and ST_DumpPoints.
We need 3 points to calculate isLeft: 2 to define the segment and the point to test for. So, for each point along the LineString, we get the X,Y coordinates of the point itself and the 2 previous points. We will be checking for the current point position in relation to the segment defined by the two previous points. This also means that the turning point, when detected, will be last point of the segment, that is: the previous point. I found this calculation to be surprisingly easy through Posgres window functions.
Use the above points to calculate a measure for isLeft.
Select the points where this measure changes.

As usual, good code practices in general also apply to the database. In particular, CTEs can be used to clarify queries in the same way you would name variables or functions in whatever programming language: to enable reuse, but also to enhance readability by giving descriptive names. There is no excuse for any of the eye-burning SQL queries that are too often considered normal in the language.

Look at the sketch solution and contrast with the following implementation to see what I mean:

WITH 
  -- Optional: area of interest.
  aoi AS (
    SELECT ST_SetSRID(
          ST_MakeBox2D(
            ST_Point(467399,4671999),
            ST_Point(470200,4674000))
          ,25831) 
        AS geom
  ),
  -- Simplify geometries to avoid excessive noise. Tolerance is empiric and depends on application
  simplified AS (
    SELECT oid as contour_id, ST_Simplify(input_contours.geom, 0.2) AS geom 
    FROM input_contours, aoi
    WHERE input_contours.geom && aoi.geom
  ), 
  -- Explode points generating index and keeping track of original curve
  points AS (
    SELECT contour_id,
        generate_series(1, st_numpoints(geom)) AS npoint,
        (ST_DumpPoints(geom)).geom AS geom
    FROM simplified
  ), 
  -- Get the numeric values for X an Y of the current point 
  coords AS (
    SELECT *, st_x(geom)::numeric AS cx, st_y(geom)::numeric AS cy
    FROM points    
    ORDER BY contour_id, npoint
  ),
  -- Add the values of the 2 previous points inside the same linestring
  -- LAG and PARTITION BY do all the work here.
  segments AS (
    SELECT *, 
      LAG(geom, 1)        over (PARTITION BY contour_id) AS prev_geom, 
      LAG(cx::numeric, 2) over (PARTITION BY contour_id) AS ax, 
      LAG(cy::numeric, 2) over (PARTITION BY contour_id) AS ay, 
      LAG(cx::numeric, 1) over (PARTITION BY contour_id) AS bx, 
      LAG(cy::numeric, 1) over (PARTITION BY contour_id) AS by
    FROM coords
    ORDER BY contour_id, npoint
  ),
  det AS (
    SELECT *, 
      (((bx-ax)*(cy-ay)) - ((by-ay)*(cx-ax))) AS det -- cross product in 2d
    FROM segments
  ),
  -- Uses the SIGN multipliaction as a proxy for XOR (change in convexity) 
  convexity AS (
    SELECT *, 
      SIGN(det) * SIGN(lag(det, 1) OVER (PARTITION BY contour_id)) AS change
    FROM det
  )
SELECT contour_id, npoint, prev_geom AS geom
FROM convexity
WHERE change = -1
ORDER BY contour_id, npoint

Here’s what the results look like for a sample area:

Tags: postgres, postgis, curve, inflection, GIS

Introspection in PostgreSQL

Joan Arnaldich — Mon, 30 Aug 2021 00:00:00 UT

Introspection in PostgreSQL

Posted on August 30, 2021

In coding, introspection refers to the ability of some systems to query and expose information on their own structure. Typical examples are being able to query an object’s methods or properties (eg. Python’s ___dict___).

In a DB system, it typically refers to the mechanism by which schema information regarding tables, attributes, foreign keys, indices, data types, etc… can be programmatically queried.

This is useful in many ways, eg:

Code reuse: making code that can be made schema-agnostic. For example, pgunit, a NUnit-style testing framework for postgresql, automatically searches for functions whose name start with test_.
Discovery and research of the structure in ill-documented or legacy database.

In this article we will explore some options for making use of the introspection capabilities of PostgreSQL.

Information schema vs system catalogs

There are two main devices to query information of the objects defined in a Postgres database. The first one is the information schema, which is defined in the SQL standard and thus expected to be portable and remain stable, but cannot provide information about posgres-specific features. As with many aspects of the SQL standard, there are vendor-specific issues (most notably Oracle does not implement it out of the box). If you are using introspection as a part of a library, and do not need to get into postgres-specific information this approach gives you a better chance for future compatibility accross RDBMS and even PostgreSQL versions.

The other approach involves querying the so called System Catalogs. These are tables belonging to the pg_catalog schema. For example, the pg_catalog.pg_class (pseudo-)table catalogs tables and most everything else that has columns or is otherwise similar to a table (views, materialized or not…). This approach is version dependent, but I would be surprised to see major changes in the near future.

This is the approach we will be focusing on in this article, because the tooling and coding ergonomics from PostgreSQL are more convenient, as you will see in the nexts sections.

Use the command-line, Luke

The psql command-line client is a very powerful and often overlooked utility (as many other command_-line tools). Typing \? after connecting will show a plethora of commands that let you inspect the DB. What most people do not know, though, is that these commands are implemented as regular SQL queries to the system catalogs and that you can actually see the code just by invoking the psql client with the -E option. For example:

PGPASSWORD= psql -E -U  -h

And then typing for the description of the pg__catalog.pg_class table itself:

\dt+ pg_catalog.pg_class

yields:

********* QUERY **********
SELECT n.nspname as "Schema",
  c.relname as "Name",
  CASE c.relkind 
    WHEN 'r' THEN 'table' 
    WHEN 'v' THEN 'view' 
    WHEN 'm' THEN 'materialized view' 
    WHEN 'i' THEN 'index' 
    WHEN 'S' THEN 'sequence' 
    WHEN 's' THEN 'special' 
    WHEN 'f' THEN 'foreign table' 
    WHEN 'p' THEN 'partitioned table' 
    WHEN 'I' THEN 'partitioned index' 
  END as "Type",
  pg_catalog.pg_get_userbyid(c.relowner) as "Owner",
  pg_catalog.pg_size_pretty(pg_catalog.pg_table_size(c.oid)) as "Size",
  pg_catalog.obj_description(c.oid, 'pg_class') as "Description"
FROM pg_catalog.pg_class c
     LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind IN ('r','p','s','')
      AND n.nspname !~ '^pg_toast'
  AND c.relname OPERATOR(pg_catalog.~) '^(pg_class)$'
  AND n.nspname OPERATOR(pg_catalog.~) '^(pg_catalog)$'
ORDER BY 1,2;
**************************

                        List of relations
   Schema   |   Name   | Type  |  Owner   |  Size  | Description
------------|----------|-------|----------|--------|-------------
 pg_catalog | pg_class | table | postgres | 136 kB |
(1 row)

Gives you a quite descriptive (and corner-case complete) template to start your own code from. For example, in the former query we could replace the ^(pg_class)$ regex with some other. Bear in mind that this trick is only helpful with the system catalog approach.

Regclasses and OIDs

Many objects in the system catalogs have some sort of “unique id” in the form of an oid attribute. It is sometimes convenient to know that you can turn descriptive names into such oids by casting into the regclass data type.

For example, in a somewhat circular turn of events, the attributes of the catalog table storing attribute information can be queried by name as:

SELECT attnum, attname, format_type(atttypid, atttypmod) as "Type" 
FROM pg_attribute 
WHERE attrelid = 'pg_attribute'::regclass 
  AND attnum > 0 
  AND NOT attisdropped ORDER BY attnum;

In the result of that query, we can see that attrelid should be an oid:

attnum     |   attname     | Type
-----------|---------------|-----------
         1 | attrelid      | oid
         2 | attname       | name
         ...
        20 | attoptions    | text[]
        21 | attfdwoptions | text[]

without the “regclass” cast, querying by name would mean joining with the pg_class and filtering by name. There are other types that will get you an oid from a string description for other objects (regprocedure for procedures, regtype for types, …).

System Catalog Information Functions

Another interesting utility for the pg_catalog approach is the ability to translate definitions into SQL DDL. We saw one of them (format_type) in the previous example, but there are many of them (constraints, function source code …).

Just refer to the section in the manual for more.

Inspecting arbitrary queries

As a sidenote, it might be useful to know that we can inspect the data types of any provided query by pretending to turn it into a temporary table. This might be useful for user-provided queries in external tools (injection caveats apply)…

CREATE TEMP TABLE tmp AS SELECT 1::numeric, now() LIMIT 0;

Wrapping up

As usual, good SW practices apply to DB code, too, and it is easy to isolate any incompatible code just by defining a clear interface in your library: instead of querying for the catalog everywhere, define just a set of views or functions that expose the introspection information to the rest of your code and work as an API. This way, any future change in system catalogs will not propagate further than those specific views. For example, if your application needs to know about tables and attribute data types, instead of querying the catalog from many places, define a view that works as in interface between the system catalogs and your code. As an example:

CREATE OR REPLACE VIEW table_columns AS
WITH table_oids AS (
      SELECT c.relname, c.oid
      FROM pg_catalog.pg_class c
        LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
      WHERE 
        pg_catalog.pg_table_is_visible(c.oid) AND relkind = 'r'),
    column_types AS (
      SELECT
        toids.relname AS "tablename", 
        a.attname as "column",
        pg_catalog.format_type(a.atttypid, a.atttypmod) as "datatype"
      FROM
        pg_catalog.pg_attribute a, table_oids toids
      WHERE
        a.attnum > 0
        AND NOT a.attisdropped
        AND a.attrelid = toids.oid)
SELECT * FROM column_types;

I will be assembling some such utility views I find useful in the future in this gist.

UPDATE Dec. 15th 2022: For any real use case, check syonfox’s solution (see comments) documented here. It is way more powerful than my solution above, which I’ll only leave here just to keep things simple in this article.

Tags: postgres, introspection, database

Optimizing Geospatial Workloads

Joan Arnaldich — Sat, 29 Feb 2020 00:00:00 UT

Optimizing Geospatial Workloads

Posted on February 29, 2020

Large area geospatial processing often involves splitting into smaller working tiles to be processed or downloaded independently. As an example, 25cm resolution orthophoto production in Catalonia is divided into 4275 rectangular tiles, as seen in the following image.

Whenever a process can be applied to those tiles independently (ie, not depending on their neighborhood), parallel processing is an easy way to increase the throughput. In such environments, the total workload has to be distributed among a fixed, often limited, number of processing units (be they cores or computers). If the scheduling mechanism requires a predefined batch to be assigned to each core (or if there is no scheduling mechanism at all), and when the processing units are of similar processing power, then the maximum speedup is attained when all batches have an equal amount of tiles.

Furthermore, since the result often has to be mosaicked in order to inspect it, or to aggregate it into a larger final product, it is desireable for the different batches to keep a spatial continuity, ideally conforming axis parallel rectangles, since that is the basic form of georeference for geospatial imagery once projected.

The problem

This is a discrete optimization problem, which can be solved using the regular machinery. Since I have been dusting off my MiniZinc abilities through Coursera’s discrete optimization series, I decided to give it a go.

Tile scheme representation

For convenience, the list of valid tiles can be read from an external .dzn data file.

ntiles = 4275;
Tiles  = [| 253, 055
          | 254, 055
          | 253, 056
          | 254, 056
          | 255, 055
          | 255, 056
          | 256, 056
          | 257, 056
          | 252, 059
          …
          |];

The above basically declares the list of valid tiles as a 2d array with ntiles rows and 2 columns. Then, in our model file (.mzn) the data will be loaded into the Tiles constant array, declared as follows:

int: ntiles;

enum e_tile = { col, row };
array[1..ntiles, e_tile ] of int: Tiles;

Notice the use of a column enum to make access easier.

From the above data, a 2d grid can be built within the bounds of minimum and maximum columns, where the grid value is true if there exists a tile in that position, and false otherwise. This builds a nice representation for modelling the spatial restrictions in the problem.

int: mincol = min([ Tiles[i, ocol] | i in 1..ntiles ]);
int: maxcol = max([ Tiles[i, ocol] | i in 1..ntiles ]);
int: minrow = min([ Tiles[i, orow] | i in 1..ntiles ]);
int: maxrow = max([ Tiles[i, orow] | i in 1..ntiles ]);

array[minrow..maxrow, mincol..maxcol] of int: Grid =
  array2d(minrow..maxrow, mincol..maxcol,
     [ exists(i in 1..ntiles)(Tiles[i, orow] == r /\ Tiles[i, ocol] == c)
       | r in minrow..maxrow, c in mincol..maxcol ]);

Note that all this is computed at compile time, before the actual optimization begins.

Box representation

Boxes are rectangles defined defined by their left, bottom and top bounds:

int: nboxes;
enum e_bbox  = { top, left, bottom, right };
array[1..nboxes, e_bbox] of var int: Boxes;

Grid positions increase like in a matrix (first row top, left column first), and their bounds are constrained within the tile grid limits. Limits are inclusive. These requirements can be expressed as a minizinc constraint:

constraint
  forall(b in 1..nboxes) (
      mincol <= Boxes[b, left] /\ Boxes[b, left]  <= maxcol /\
      minrow <= Boxes[b, top] /\ Boxes[b, top] <= maxcol /\
      Boxes[b, left] <= Boxes[b, right] /\
      Boxes[b, top] <= Boxes[b, bottom]);

Each tile belongs to just one box, so boxes do not overlap.

predicate no_overlap(var int:l1, var int:t1, var int:b1, var int:r1,
                     var int:l2, var int:t2, var int:b2, var int:r2) =
   r1 < l2 \/ l1 > r2 \/ b1 < t2 \/ t1 > b2 \/
   r2 < l1 \/ l2 > r1 \/ b2 < t1 \/ t2 > b1;

constraint 
forall(b1,b2 in 1..nboxes where b1 < b2) (
    no_overlap(
     	Boxes[b1, left], Boxes[b1, top], Boxes[b1, bottom], Boxes[b1, right],
	    Boxes[b2, left], Boxes[b2, top], Boxes[b2, bottom], Boxes[b2, right]));

Assignment

In the end we want an array relating every tile with its box. Since we chose to represent a tile by its row and column, this can be modeled as a 2d array of nboxes. We will reserve a special 0 value for the empty tiles within the grid.

array[minrow..maxrow, mincol..maxcol] of var 0..nboxes: Assignment;

The rules that relate the tile Grid with the Boxes and Assignment vector can be enumerated as follows:

Every tile inside the range of a box is assigned to it.
Tiles not present are not assigned.
Tiles not assigned to a box, but present, are assigned to another box.

constraint
  forall(b in 1..nboxes) (
      forall(r in minrow..maxrow) (
          forall(c in mincol..maxcol) (
            if Grid[r,c] > 0 then
              if contains(Boxes[b, left], Boxes[b, top],
                          Boxes[b, bottom], Boxes[b, right],
                          r, c)Ti
              then
                % 1 - Tiles within the range of a box are assigned to it
                Assignment[r,c] = b
              else
                % 3 - Tiles not assigned to a box are assigned to another
                Assignment[r,c] != b /\ Assignment[r,c] > 0
              endif
            else
              % 2 - Tiles not present are not assigned
              Assignment[r,c] = 0
            endif)));

Objective function

We want to make the resulting rectangles as equal as possible. In order to do so, we have to gather the cardinalities of each box.

array[1..nboxes] of var int: BoxCardinality =
  [ sum(r in minrow..maxrow, c in mincol..maxcol)(Grid[r,c] > 0 /\ Assignment[r,c] == b) | b in 1..nboxes];

This can be done by minimizing the variance, which is the same as minimizing the square L2 norm (dot product of a vector with itself).

var int: variance = sum(b in 1..nboxes)(BoxCardinality[b]*BoxCardinality[b]);
solve minimize variance;

Showing the results

It is useful to dump the result in some format that can be easily parsed by standard command-line tools, since some models have to be further processed. In this case, the lines corresponding to the assignment vector are prefixed with the tag Tiles to make them easy to redirect to another file.

The printing itself can be done with a combination of helper functions and array comprehensions.

function string: show_assignment(int: r, int: c) = "Tile: " ++ show(c) ++ "-" ++ show(r) ++ "," ++ show(Assignment[r,c]) ++ "\n";

output 
  [ show_assignment(r,c) | r in minrow..maxrow, c in mincol..maxcol where Grid[r,c] > 0 ] ++ 
  [ "Variance: ", show(variance), "\n",
    "Box Cardinalities: ",  show(BoxCardinality) , "\n" ];

For powershell users, this could be captured, for example:

$ENV:FLATZINC_CMD = "fzn-gecode"
$Env:PATH += ";D:\Soft\MiniZinc\"
minizinc.exe -I D:\Soft\MiniZinc\share\minizinc\gecode\ .\tall5m.mzn .\tall5m.dzn | ? { $_ -match "Ortho: " } | % { $_ -replace "Ortho: " } | out-file -encoding ascii assign5.csv

Not so fast!

For big grids, the process is too slow (on my hardware, ymmv). A practical way to mitigate that problem is including further “artificial” restrictions that capture some common-sense knowledge. Here we can set that box cardinalities belong to an environment around a perfect one, which would happen when every box has ntiles / nboxes tiles.

We can define a parameter slack, that will represent the radius of the environment, and add the following constraint:

% All boxes have at least one tile assigned to it
float: fill_factor = (ntiles / nboxes);

constraint
   forall(b in 1..nboxes) ( (1.0 - slack)*fill_factor <= BoxCardinality[b] /\ BoxCardinality[b] <= (1.0 + slack)*fill_factor ) ;

This is common in discrete optimization problems, where a hybrid system can be developed. In this case, we could use some sort of search to optimize for the value of the slack, with different invocations of minizinc.

Results

By processing the results of minizinc and joining the results into a QGis project, we can easily map the box assignment. Here is the result for 4 boxes:

For 8 boxes (8 parallel processors), the result would be:

Conclusions

Even when I know the basic theory behind mixed integer and fp solvers (even implemented a simplex-based solver as a practical exercise in the past), I keep having the feeling there is some form of magic at work here.

There are lots of other ways to model this problem. In particular, MiniZinc has special primitives for dealing with sets. Some of the restriction explicitly stated by the model are already available for reuse in the globals library, which would probably more efficient and would lead to terser code. I would like to rewrite the model using these functions and compare their efficiency if I ever have the time.

For now, I got my results!

Tags: geospatial, minizinc, optimization

Foreign Data Wrappers for Data Synchronization

Joan Arnaldich — Tue, 02 Oct 2018 00:00:00 UT

Foreign Data Wrappers for Data Synchronization

Posted on October 2, 2018

The standard way of copying databases (or just tables) between PostgreSQL servers seems to be through backup (with its many options). This article describes another way that I have been using lately using foreign data wrappers. The two servers need to be able to connect to each other (albeit only during the synchronization time), it does not need shell access to any of them and avoids generating intermediate files. The steps involved are:

Install the foreign data wrapper extension for your database (just once in the target server).
Setup the foreign server connection from in the target server pointing to the source server.
Setup a user mapping.
Import the foreign tables (or schema) into a “proxy” schema.
Create materialized views for the desired tables

Install the Foreign Data Wrapper

Foreign Data Wrappers are a mechanism that allow presenting external data sources as PostgreSQL tables. Note that it is not limited to foreign PostgreSQL databases: there are foreign data wrappers for other DB servers and other sources of data, including CSV files and even Twitter streams. Once the data is presented into Postgres, all the power of SQL becomes available for your data, so they are quite a feature for data management and integration.

FDWs are installed as extensions for every kind of data source. Of course there is one FDW for connecting to external PostgreSQL databases in the standard distribution:

CREATE EXTENSION postgres_fdw;

Creating the server

Once the extension is installed, the remote server needs to be set up:

CREATE SERVER remote_server
  FOREIGN DATA WRAPPER postgres_fdw
  OPTIONS (host 'host_or_ip', dbname 'db_name');

Creating the user mapping

In order to allow for a greater flexibility in terms of permissions, the remote server needs a user mapping, wich will map users between the source and target servers. For every mapped user, the following sentence should be used:

CREATE USER MAPPING FOR postgres SERVER perry
    OPTIONS (password 'pwd', "user" 'postgres');

Create the foreign tables

Once the database and server are linked, we can start creating the foreign tables. Notice that foreign tables are just “proxies” for the external tables (think of them as symlinks in the file system or pointers in a programming language). That means creating them is just a matter of defining their structure, no data is transferred, and hence should be fast. The downside is that the description for the foreign tables has to be written in the target server (much like writing the table create script).

In order to make the process easier, PostgreSQL has a command that will just import the foreign structure through:

IMPORT FOREIGN SCHEMA source_schema FROM SERVER source_server INTO proxy_schema;

If you just want to import some tables of the schema, you can use:

IMPORT FOREIGN SCHEMA public LIMIT TO 
( table1, table2 )
FROM SERVER source_server INTO proxy_schema;

Just refer to the [documentation] for other options.

You can verify that the tables have been imported typing \det inside the psql cli.

Instantiate materialized views

As stated before, foreign tables are just a proxy for the real data. In order to be able to work independently of the source server, actual data needs to be copied. The easiest way to do so in order to be able to update the data is through materialized views. You can think of them as new tables with a refresh mechanism. In particular, that means that the original indices over the data will be lost, so new indices should be created.

CREATE MATERIALIZED VIEW view_name AS SELECT * FROM proxy_schema.table_name;

Whenever the data needs to be refreshed, just:

REFRESH MATERIALIZED VIEW view_name;

In the command-line client

The following commands might be useful if you use the psql client>

\det lists foreign tables
\des lists foreign servers
\deu lists user mappings
\dew lists foreign-data wrappers
\dm list marterialized views

Helper function

Depending on how many tables you wish to import, something along the following anonymous code block might be useful. It just creates the materialized views and indices, but it can be adapted to whatever is needed.

DO $$
DECLARE r record;
BEGIN
    FOR r IN SELECT tname FROM (VALUES 
            ('table1'),
            ('table2'), 
            ('...'), 
            ('tableN')) AS x(tname)
    LOOP
        -- SQL automatically concatenates strings if there is a line separator in between
        EXECUTE format('CREATE MATERIALIZED VIEW IF NOT EXISTS %s AS '
            'SELECT * FROM proxy_schema.%s',
             r.tname, r.tname);
        -- Index by geometry (Postgis), just an example
        EXECUTE format('CREATE INDEX IF NOT EXISTS sidx_%s ON %s USING GIST (geom)',
             r.tname, r.tname);
    END LOOP;
END$$

Conclusion and final warnings

This method can be convenient if data has to be synced frequently, as it just boils down to refreshing the materialized view. This can also be used from within other pgplsql functions, and needs no external tools (apart from a client) or intermediate files.

It is not for every situation, though. In particular, the data in materialized views are not backed up (psql generates the view create script and performs a REFRESH). That means that if the original server is unavailable at restore time, data will be lost. This can be avoided by using regular proxy tables instead of materialized views.

Tags: postgres, fdw, backup

Porting the Unix Philosophy to Windows

Joan Arnaldich — Tue, 06 Jun 2017 00:00:00 UT

Porting the Unix Philosophy to Windows

Posted on June 6, 2017

Since 2016, it is safe to say that windows has a pretty decent shell. Actually, it’s had it from some time now, but on August 2016 went open-source and cross-platform. Although it is still common to find old-style .bat files lingering around in many organizations, it looks clear that PowerShell is getting out of its initial sysadmin niche towards becoming the new de-facto standard shell for Windows (hey, even for malware…). And no, I do not even think WSH deserves a mention.

Arguably, the most profound change use of PowerShell is not having a more powerful (ehem!) shell at windows, but enabling the kind of scripting Unix has excelled at before.

The Unix Philosophy

The Unix Philosophy is often epitomized in one sentence:

Do One Thing and Do It Well

The implications of that for shell scripting are translated into the fact that Unix has lots of small executables devoted to one task, and it is the shell’s responsibility to enable composing these bits of functionality into more complex ones, the most prominent tool for that being the pipe | operator, which feeds the output of a program into the input of the next.

Powershell has a more ore less generalized version of this, where the pieces are called CmdLets, and what is sent down the pipe is a stream of CLR objects, not just a stream of bytes. Before PowerShell, it was impossible to do this in Windows to the extent that it was in Unix.

The idea is simple, but the skill is difficult to master. Shell scripting is programming, but the abstractions provided by the shell are different from the ones you would find in a fully-fledged programming language.

In this blog post I would like to present a particular example of what this change means.

The Task

A quick and dirty way to monitor the progress of a batch process is probing the number of output files in a directory at regular intervals and maybe save that to a file eg. for plotting, statistics, etc…

It is quite probable that a developer would come up with a solution very much like the function below:

function Sample-Count-Files {
    [CmdletBinding()]

    param(
        # The file pattern to count
        [Parameter(Mandatory=$true, 
                   Position=0)]
        [string]$pattern,

        # Name of the log file
        [Parameter(Mandatory=$true, 
            Position=1)]
        [string]$logfile,

        # Seconds interval between samples
        [Parameter(Mandatory=$true, 
            Position=2)]
        [int]$seconds)

    While($true) {
        sleep $seconds;
        $cnt = (dir $pattern).Count;
        $d = Get-Date;
        $d.ToString("yyyy-MM-dd HH:mm:ss") + "`t$cnt" | Out-File $logfile -encoding ascii -Append -ob 0
    }
}

That is, an infinte loop which waits for a number of seconds before counting the output files and outputting a line with the date and result. Executing the above code will block the console, so to monitor progress you can always wrap it into a PSJob or open a new console and then just tail the file:

gc -tail 10 -Wait samples.tsv

I suspect most developers that have not dealt with Unix scripting would come up with something along the lines of this. Conversely, my bet is any experienced Unix scripter would frown upon it, feeling the script is trying to do too much. In particular, it is in charge of:

The When: The counting is done every n seconds.
The What: The counting itself.
The Output: We are saving into a text file and redirecting to screen.

These three pieces of functionality are coupled in our function, and they would’nt need to be so. For example, separating the when from the what would allow us to fire any action every n seconds.

This is indeed not difficult in PowerShell, see:

function Tick {
    [CmdletBinding()]
    Param([Parameter(Mandatory=$true, Position=0)][int]$Seconds)
    Process {
        while($true) { 
            Start-Sleep -Seconds $Seconds
            Get-Date
        }    
    }
}

This CmdLet waits for a number of seconds before sending a Date object downstream to do whatever we please with it, and loops (mind it is inside the Process section). It is like a pulse generating objects at regular intervals, so now we can reuse the when in different contexts:

Tick 10 | % { $_ }

We can write a similar function for the what. In a real scenario we probably would just write a one-liner, because our code is so simple, but in this post we will go the full way. As a bonus, instead of working with strings, we demonstrate how to pass custom objects down the pipeline. Here, we create an object with two members: the time and the file count.


function Count {
    [CmdletBinding()]
    Param(
    	[Parameter(Mandatory=$true, Position=0)]
	    [string]
	    $Pattern,

        [Parameter(Mandatory=$true, Position=1, ValueFromPipeline=$true)]
	    $Time
    )

    Process {
        [PSCustomObject]@{ 
           Time=$Time; 
           Files= $(dir $Pattern).Count 
        }
    }
}

For the output, we could write our own function again, but it turns out there already is a Powershell function that formats objects into CSV files, conveniently named Export-Csv. Putting it all together:

Tick 1 -ob 0 | Count *.out | Export-Csv times.csv

Which is clean, easy to understand and easy to reuse, just like good Unix scripting is supposed to. By the way, if you are wondering where the -ob 0 or -ObjectBuffer 0 came from (we did not explicitly add it to our script), that is known as a common parameter. For efficiency reasons, Powershell can wait until a bunch of objects are accumulated into a buffer before sending them downstream. Obviously that is not what we want here, so we set the buffer size to 0.

Conclusion

It is often a good idea, when approaching shell scripting, to take a step back and think whether we are trying to accomplish too much at once, and which pieces of functionality we would like to reuse in the future. That is not really different from the software architecture best practices applied when programming in the large, but the abstractions (programming whith pipes and streams) are. Now Windows programmers can apply the same principles at play in Unix for decades.

jarnaldich.me

Near Duplicates Detection

Near Duplicates Detection

Why bother about near duplicates?

The algorithm

N-gram sets

Jaccard Index

If we were to scale…

The Code

Conclusions

References

Dealing with CORS in JupyterLite

Dealing with CORS in JupyterLite

Same Origin Policy

The JSONP Callback

Implementing a JSONP helper in JupyterLite

Conclusions

References

Data Manipulation with JupyterLite

Data Manipulation with JupyterLite

JupyterLite

Example: Not so Simple Excel Manipulation

Highlights

Conclusion

Cloud Optimized Vector

Cloud Optimized Vector

What makes something cloud optimized anyway?

The noble art of bashing shapefiles

Is there a better option?

Are we there yet?

Conclusion

ETL The Haskell Way

ETL The Haskell Way

The Problem

Parsing Libraries

Optics

Scripting

The Solution

Revenge of the Nerds

Finding Curve Inflection Points in PostGIS

Finding Curve Inflection Points in PostGIS

A sketch of the solution

PostGIS implementaton

Introspection in PostgreSQL

Introspection in PostgreSQL

Information schema vs system catalogs

Use the command-line, Luke

Regclasses and OIDs

System Catalog Information Functions

Inspecting arbitrary queries

Wrapping up

Optimizing Geospatial Workloads

Optimizing Geospatial Workloads

The problem

Tile scheme representation

Box representation

Assignment

Objective function

Showing the results

Not so fast!

Results

Conclusions

Foreign Data Wrappers for Data Synchronization

Foreign Data Wrappers for Data Synchronization

Install the Foreign Data Wrapper

Creating the server

Creating the user mapping

Create the foreign tables

Instantiate materialized views

In the command-line client

Helper function

Conclusion and final warnings

Porting the Unix Philosophy to Windows

Porting the Unix Philosophy to Windows

The Unix Philosophy

The Task

Conclusion

See also