Using Geographic data in ADLA

Category: azure data lake

Question

Shannon Lowder on Wed, 18 Jul 2018 20:27:46


I have weather data in my data lake with coordinates for weather events.  In SQL Server I have geographic information for State, Counties, Townships, Ranges, and properties.  What I need to do is cross-reference the coordinates in the weather data for the shapes for each of the geographic types.

If this were SQL Server, I'd simply use STIntersects() to do my matches.  Since this is a data lake, I'm trying to find the best way to handle 60+ years of weather data.

I found an example online that shows how to pull in the c# assemblies needed to write USQL with Geographic methods. The problem is, you can't use these extended data types with EXTRACT or OUTPUT statements. So I can't simply copy my geographic data to ADLS.  Since there are functions for turning the geographies into text and back again, I attempted to create text files with textual representations of my geographies.

These geographies are either POLYGON or MULTIPOLYGON shapes. Here's the script I used to try and copy this data to ADLS.

@state =
    SELECT *
    FROM EXTERNAL HailResarch.CPISSTGVMSQL_ODS EXECUTE @"
        SELECT 
	        StateCode, StateName, PostalCode, Shape.ToString() AS ShapeText
        FROM cpis_ods.ods.[State]
        WHERE
	        StateName NOT IN ('Unknown','Foreign Country')";
OUTPUT @state
TO "/sandbox/hail_research/StateShapes.csv"
USING Outputters.Csv(outputHeader: true, quoting: true, rowDelimiter: "\n");

Unfortunately, this doesn't work.  I get an error "Value in the SQL column ShapeText at row 0 is too large." Turns out Strings are limited to 128 kilobytes.

Is there an official way to perform this kind of cross-reference in USQL? Any advice will be appreciated!



Replies

MRys on Thu, 19 Jul 2018 23:50:35


Dear Shannon

Unfortunately you will have to do a bit more processing.

I would suggest that you read the geometry shapes into byte[] (that can be up to 4MB) and then converts it into a geometry shape.

Also note that you will have to wrap the UDTs (geometry/geography) into wrapper types that provide a serialization, if you need to flow the data between stages in your script (since it needs to be temporarily serialized).

Best regards

Michael

Shannon Lowder on Mon, 23 Jul 2018 12:27:59


Unfortunately, some of my multipolygon shapes are over 4MB. Alaska, Florida, North Carolina, Louisiana, Maryland, and California are over this limit. Is there any way around the upper limit for binaries in ADLA?

In the meantime, I'm going to see if I can use the County level data.  Hopefully, they're all below the 4MB limit.


MRys on Tue, 24 Jul 2018 21:21:47


Unfortunately, the 4MB limit is currently the max. We are looking into increasing the rowsize to 16MB, so that would give you some additional size. Alternatively, you may want to look at reducing the complexity of the shapes. I think the spatial library has some reduction methods if I remember correctly.

Shannon Lowder on Wed, 25 Jul 2018 21:03:31


I'll look into reducing the complexity.  I did finally get a version of the code to work.  What I found is once the script got to the STIntersects check, the performance just came to a halt.  Maybe I should geocode the rows BEFORE that data gets to the lake?

What's your experience been with this kind of workload in ADLA?