Friday, March 28, 2008

Cache, fast as a speeding bullet

When it comes to viewing and analyzing data, the main difference between a typical non-spatial, data-driven application and a GIS application is volume. The strategies for dealing with large datasets are well known for non-spatial apps. For example, you might page through data ten rows at a time (results from a Google search). Or you might use summarizing functions - like SUM, MAX, and UNIQUE - against the data before viewing it (pie charts). Or you might load a small sliver of each record, then later, load an entire record only when an interesting one is located (first a list of just airport codes, then the airport's full schedule when an airport is picked).

Of course, the very nature of non-spatial data apps solves most of these problems for us. It simply makes no sense returning the entire result set of, say, a Google search. Until all humans come standard with cybernetic coprocessors, we simply can't consume much more tabular data then about ten rows at a time.

Currently, spatial data processing has no real equivalent to these techniques. For instance, every time a zoom-all is performed, the entire dataset is processed and drawn to the display.

Now humor me for a moment and pretend you are Superman. You’ve just been called to duty - a Floridian octogenarian’s feline is stuck in a ficus. You, being superman, must high-cape it from Metropolis to Ft. Lauderdale, pronto. Fortunately, today’s visibility is 10 miles, and will make navigation a snap. You step out into the sun after a quick phone booth visit and note that you need to update your change room to something more modern and ubiquitous; Starbucks perhaps. Or maybe iPhone 2.0 will somehow become a portable phone booth. Yeah, that’d be nice … suggest it to Steve at the next Apple board meeting … anyway, where were we? Oh yes, cats in peril and GIS metaphors. Imagine your view of the terrain as you take to flight. At first the detail is cars, people, PEOPL… *snap* *snap* Are you listening?! Eyes off Lois! Pay attention here, boy wonder! As you rise, detail reduces to buildings and city blocks. Eventually the view becomes nothing more than a patchwork of green and brown areas of farm surrounding the grey blob that is Metropolis. The scenario then plays in reverse as you descend into your destination. Detail quickly increases while you “zoom into” Ft. Lauderdale. And, yeah yeah, you save the day and all that happy stuff. But that’s neither here nor there.

The point is, the “real world” (comic book or otherwise) naturally works in our favor to reduce the complexity and amount of data humans must visually process. And it doesn’t matter whether this “lossy compression” is attributed to the eye’s limitations, or the brain’s limitations, or both. What matters is that it’s an appropriate model to use for applying to GIS. We need an algorithm that produces a continuous set of outputs that visually summarizes the spatial data for our human consumption. An “antialiasing” of map data, if you will. It’s true; we have seen some primitive attempts at this involving scale-dependent filtering where more and more detail is added at discrete increments. For instance, a typical road map shows no roads at the national level, shows highways at the state level, secondary roads at county level, and all roads at the neighborhood level. But this approach suffers in two major ways:

First, tweaking a map so it’s “just right” is not only tedious, mind numbing work, but the time investment can be considerable. And setting up the min & max scale factors for each layer isn’t the half of it. Those who take pride in making beautiful, user friendly maps know that gobs of time is spent breaking up a single, detailed layer into a set of layers. This approach requires one, very detailed layer, one, very general layer, and one for each increment in between. Do this for each layer that requires it and you’ll see what I’m getting at. In fact, it gets worse. Append to this process the ever looming task of data updates. The whole scenario quickly spirals into one big GIS headache.

Second, setting arbitrary levels of detail is just that, arbitrary, and leads to a “jagged” user experience. What this means is that features of the map appear and disappear randomly. One moment highway shield symbols are visible, you zoom out a little, and the next moment they vanish. This unpredictability disorients users forcing them to work harder to find the information they want. Imagine, instead, zooming in and out fluidly and jaggy free. The experience would have greater utility and be more enjoyable overall, maybe even fun.

The good news is research in this area is beginning to produce some tangible technology. PostGIS includes the SIMPLIFY function. But it needs some work before it can be applied to the scenario discussed here. SIMPLIFY produces undesired artifacts at various locations where polygons touch. Polygon seems should remain touching instead of the holes that occur when applying the Douglas-Peucker algorithm as the SIMPLIFY function does (see this). There’s also Seadragon which was recently announced would be included in Silverlight 2.0. Do yourself a favor, if you haven’t looked at the TED video, do so, you won’t be disappointed. Of course Seadragon is a raster technology but I dream of a GIS future that has Seadragon-like vector capabilities.

So where does zigGIS fit into this whole grand scheme? Now that all the planned 2.0 features are implemented, we’ve been focusing on tuning overall performance. This got me thinking about the various optimization techniques we could use. We did tinker with SIMPLIFY but this would not only work due to reasons explained above, but the ArcObjects framework doesn’t support this technique. It’s impossible to differentiate whether ArcObjects is requesting geometry for a zoom operation (where SIMPLIFY is useful) or some other operation such as editing (where SIMPLIFY is inappropriate and even erroneous). Ultimately we decided to implement a caching scheme. I think the results are impressive but you decide for yourself.

view screencast

Without caching, a recordset of 22,600 records takes ~19s to process a full zoom. With caching, the same recordset takes ~4s. Happy cats in less time!

No comments: