Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

turfpy.measurement.points_within_polygon and turfpy.measurement.boolean_point_in_polygon are very slow #62

Open
zsiegel92 opened this issue Oct 9, 2020 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@zsiegel92
Copy link

Hey there,

I noticed that points_within_polygon (and the foreach-point-callback boolean_point_in_polygon) was fairly slow compared to the Shapely package's shapely.geometry.Polygon.contains method.

As a test, I used a featureCollection called points consisting of roughly 5,000 points (in the Columbus, Ohio metro area fyi) and a geojson.Polygon called polygon (compatible with turfpy methods).

Method 1: turfpy.measurement.points_within_polygon

Calling turfpy.measurement.points_within_polygon(points,polygon) takes roughly 26 seconds.

Method 2: shapely.geometry.Polygon.contains

I wrote a method that:

  1. instantiates a shapely.geometry.Polygon called polygon from a geojson.Polygon and also creates a Python list of points of type shapely.geometry.Point,
  2. overwrites the list of points to only contain those where polygon.contains(point) is True,
  3. converts the list of points back to a geojson.featureCollection, which it returns.

This takes roughly 5 seconds.

Conclusion

The two methods return the same number of points; furthermore, I plotted the points on a map in my browser (using Folium), and neither one is doing anything wrong.

Even with all the object creation (literally copying the list of points, taking up additional memory, too), Method 2 (Shapely) took <1/4 the time.

So - in case anyone intends to use this method, perhaps it can be reworked. For now, I'll post my "faster" method here.

from geojson import Polygon, Point, Feature, FeatureCollection
import shapely.geometry
import shapely

def create_point(lon,lat):
	return Feature(geometry=Point((lon,lat)))

def point_to_shapely(point):
	lon,lat = get_lon_lat(point)
	return shapely.geometry.Point(lon,lat)

def get_list_of_shapely_points(featureCollection):
	return [point_to_shapely(point) for point in featureCollection['features']]

def polygon_to_shapely(polygon):
	return shapely.geometry.Polygon(polygon['coordinates'][0])

def shapely_points_to_featureCollection(points):
	geojson_pts = [create_point(point.x,point.y) for point in points]
	return FeatureCollection(geojson_pts)

def get_contained_points_shapelystyle(points,polygon):
	shply_polygon = polygon_to_shapely(polygon)
	shapely_points = [shply_point for shply_point in get_list_of_shapely_points(points) if shply_polygon.contains(shply_point)]
	return shapely_points_to_featureCollection(shapely_points)
@zsiegel92
Copy link
Author

To be fair, my points have no geojson properties, but I still think this is odd and maybe bad.

@sackh
Copy link
Collaborator

sackh commented Oct 9, 2020

Thanks, @zsiegel92 for reporting this. We will surely look into this. Meanwhile, if you want to raise PR then feel free to raise it.
:)

@sackh sackh self-assigned this Oct 12, 2020
@sackh sackh added the enhancement New feature or request label Oct 12, 2020
@sackh
Copy link
Collaborator

sackh commented Oct 19, 2020

@zsiegel92 - Apart from finding Point in single Polygon currently we also support multiple Polygons and MultiPolygons, but yes we will need to improve the performance of it.

@sackh sackh removed their assignment Oct 24, 2020
@zsiegel92
Copy link
Author

@sackh Thanks for this response! I'm glad you're looking into it. When I encounter a MultiPolygon called feature, I just iterate through the coordinate lists in feature.geometry.coordinates, create a Polygon for each list of coordinate, and then use the method above. Aside from the object-copying I mentioned, it's basically an O(1) transformation, so the difference should remain, though I haven't done a timed test. I can't imagine turfpy does anything much different... I can't imagine there's any theoretical MultiPolygon.contains algorithm that is so much more clever that it is worth a 4x slowdown for a Polygon...

@omanges omanges self-assigned this Nov 8, 2020
@omanges
Copy link
Owner

omanges commented Nov 21, 2020

@zsiegel92 can you share the geojson for the points and polygon you test with I have made some progress so to test it I can use that geojson data.

@omanges
Copy link
Owner

omanges commented Nov 21, 2020

@zsiegel92 I have merged the changes and test with some huge data it looks good, made a new release v0.0.5 you might wanna give it a try 😉

@zsiegel92
Copy link
Author

Hi @omanges I look forward to trying out the new implementation! I already have a parallel set of functions that uses this method rather than my workaround, so I should be able to test it out soon.

This will simplify my codebase, as I will not have any reason to import shapely if this method is comparably performant!

@zsiegel92
Copy link
Author

@omanges I just ran a small and a large trial of my current use case and got the following output:

Using turfpy.measurement.points_within_polygon:

595 points:

Ran 15 trials with average time of 2.7013930002848308 seconds!

14772 points

Ran 5 trials with average time of 57.62896718978882 seconds!

Using Shapely workaround:

595 points:

Ran 15 trials with average time of 1.3027364571889242 seconds!

14772 points

Ran 5 trials with average time of 31.952202796936035 seconds!

@zsiegel92
Copy link
Author

So: it looks like your update improved the performance of this method by a lot! It's still a bit slower than using the shapely method, for some reason.

I noticed when I updated turfpy, pip mentioned downloading shapely as a dependency...was this always a dependency, or did your update utilize Shapely?

@omanges
Copy link
Owner

omanges commented Nov 27, 2020

@zsiegel92 Actually, we had added shapely few version earlier, but it is used for some other functions, I have one question did you increased the chunk size parameter, I bet it will give you more after results than shapely :) Give it a try.

@omanges
Copy link
Owner

omanges commented Nov 27, 2020

@zsiegel92 By default the chunk_size is 1, but can you try by setting it to 9?
Following is the new signature of the function.

def points_within_polygon(
    points: Union[Feature, FeatureCollection],
    polygons: Union[Feature, FeatureCollection],
    chunk_size: int = 1,
) -> FeatureCollection:

@zsiegel92
Copy link
Author

Definitely an improvement as I increase chunk size! I increased it between 1 and 9, and results improved. What is the maximum chunk size at which you expect to see an improvement?

param={'shapelyStyle': True, 'chunk_size': 1}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': True, 'chunk_size': 1}: 7.476626777648926
param={'shapelyStyle': False, 'chunk_size': 1}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 1}: 23.191674423217773
param={'shapelyStyle': False, 'chunk_size': 2}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 2}: 19.63445348739624
param={'shapelyStyle': False, 'chunk_size': 3}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 3}: 15.31570062637329
param={'shapelyStyle': False, 'chunk_size': 4}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 4}: 14.534931993484497
param={'shapelyStyle': False, 'chunk_size': 5}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 5}: 13.989737749099731
param={'shapelyStyle': False, 'chunk_size': 6}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 6}: 13.749914312362671
param={'shapelyStyle': False, 'chunk_size': 7}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 7}: 13.314592599868774
param={'shapelyStyle': False, 'chunk_size': 8}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 8}: 13.24393539428711
param={'shapelyStyle': False, 'chunk_size': 9}
Iteration 1/5 Iteration 2/5 Iteration 3/5 Iteration 4/5 Iteration 5/5 Average for {'shapelyStyle': False, 'chunk_size': 9}: 13.031423473358155

@zsiegel92
Copy link
Author

Sorry that's tough to understand - the first test was using the Shapely variant I wrote, with average time 7.48s (chunk_size is written as 1 but is not meaningful for this trial). The subsequent trials used the Turfpy function; with chunk_size 1, the time was 23.19s, and it decreased all the way down to 13.03s with chunk_size 9.

@zsiegel92
Copy link
Author

@omanges I ran a few trials overnight and noticed a roughly constant performance for chunk_sizes between 30 and 150. It leveled out at taking around 1.5x the time that Shapely took. Without knowing what this means, it's hard to know what the ideal value should be.

@omanges
Copy link
Owner

omanges commented Nov 30, 2020

@zsiegel92 the chunk_size parameter means same as the chunk_size present in multiprocessor.map please refer to this document https://docs.python.org/release/2.6.6/library/multiprocessing.html#multiprocessing.pool.multiprocessing.Pool.map

@zsiegel92
Copy link
Author

@omanges Thanks! Now I've tested for chunk_sizes of 100, 500, 1000, 10000, and 15000, all with similar performance (12-16s per trial, with <7s for the Shapely version).

@omanges
Copy link
Owner

omanges commented Dec 2, 2020

@zsiegel92 I think further improvement can be done by improving the turfpy.measurement.boolean_point_in_polygon, and then it works faster than shapely as well.

@omanges
Copy link
Owner

omanges commented Dec 2, 2020

And really Thank You !!! for doing such a great investigation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants