forked from camerondavison/util
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
initial commit of urlparsing project with QueryStringParserCallBack a…
…nd other util methods. Added benchmarking code for urlparam parsing, and number parsing utils
- Loading branch information
Preetha Appan
authored and
Jack Humphrey
committed
Jan 29, 2014
1 parent
85665bb
commit e3c5e18
Showing
23 changed files
with
2,001,403 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,2 +1,4 @@ | ||
*.iml | ||
.*.sw? | ||
*/target/ | ||
urlparsing/src/test/resources/logentries.txt.gz |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,139 @@ | ||
util-urlparsing | ||
|
||
## About | ||
|
||
`urlparsing` is a set of classes for efficiently parsing key value params from query strings in URLs without unnecessary intermediate object creation. It also includes number parsing methods in `ParseUtils` that can parse floats, ints and longs from a query string. Benchmarks are in the test folder along with sample data. | ||
|
||
## Motivation | ||
|
||
Java versions 1.6 and lower have a [significant flaw](http://stackoverflow.com/questions/1281549/memory-leak-traps-in-the-java-standard-api/1281569#1281569) that leads to inefficient memory usage when using the `String.substring` method. We process large strings containing key value pairs of log event data, where we only need much smaller substrings from it. The flaw in the older JVM versions is that they keep the larger char[] array around even after no reference to the original String exists, which led to unnecessary out of memory errors in our log processing code. `QueryStringParser` was written to solve this problem. | ||
Note that this issue has been addressed in the newest versions of Java 1.7. | ||
|
||
Furthermore, our query parsing benchmark shows nearly 4X speedup over a naive Java implementation using `String.split` under significant heap space constraints. It can parse million key-value pairs in under 3 seconds given a max heap of only `-Xmx64M`. Our number parsing benchmark shows over 2X speedup compared to equivalent methods like `Integer.parseInt` and `Float.parseFloat` | ||
|
||
## Usage | ||
|
||
The class `QueryStringParser` has a parse method that accepts a callback interface`QueryStringParserCallBack<T>`. To parse any query string, you'll need to implement that interface. When the parse method finds a key, it uses the callback to provide start and end offsets in the string for the key and value. Implementors of this interface can then set the parsed value in the provided `<T>`storage class. | ||
|
||
Multiple callbacks can be chained together using `QueryStringParserCallbackBuilder`. The advantage of using a callback pattern here is that you can parse just the keys you are interested in from a longer query string. | ||
|
||
For example: | ||
|
||
```java | ||
public class Foo { | ||
String stringValue; | ||
int intValue; | ||
} | ||
|
||
private static final QueryStringParserCallback<Foo> stringValueParser = new QueryStringParserCallback<Foo>() { | ||
@Override | ||
public void parseKeyValuePair(String urlParams, int keyStart, int keyEnd, int valueStart, int valueEnd, Foo storage) { | ||
storage.stringValue = urlParams.substring(valueStart, valueEnd); | ||
} | ||
}; | ||
|
||
private static final QueryStringParserCallback<Foo> intValueParser = new QueryStringParserCallback<Foo>() { | ||
@Override | ||
public void parseKeyValuePair(String urlParams, int keyStart, int keyEnd, int valueStart, int valueEnd, Foo storage) { | ||
storage.intValue = ParseUtils.parseUnsignedInt(urlParams, valueStart, valueEnd); | ||
} | ||
}; | ||
|
||
public void parse(String logentry) { | ||
QueryStringParserCallbackBuilder<Storage> builder = new QueryStringParserCallbackBuilder<Foo>(); | ||
builder.addCallback("stringKey", stringValueParser); | ||
builder.addCallback("intKey", intValueParser); | ||
|
||
final String queryString = "a=x&b=y&foo=bar&stringKey=hello&intKey=111&foobar=1"; | ||
|
||
final QueryStringParserCallback<Storage> queryStringParser = builder.buildCallback(); | ||
final Foo foo = new Foo(); | ||
QueryStringParser.parseQueryString(queryString, queryStringParser, foo); | ||
assert foo.intValue == 111; | ||
assert foo.stringValue.equals("hello"); | ||
|
||
} | ||
.. | ||
``` | ||
|
||
In the above parse method, foo.stringValue will be set to "hello" and storage.intValue will be set to 111. Note that the rest of the keys are essentially ignored because we only added callbacks for two of them. | ||
|
||
## ParseUtils | ||
ParseUtils includes static utility methods to parse integers, longs and floating points from strings efficiently. It also includes a method to url-decode strings. All these methods avoid intermediate string object creation when parsing numbers from strings. Use them inside the query parser callback described above. The following examples illustrate this. | ||
|
||
This example parses an integer inside a callback registered for the "userid" key using `ParseUtils.parseInt`. It avoids an intermediate object created by `queryString.substring(valueStart, valueEnd)`, which is unavoidable if using `Integer.parseInt(s)` to parse instead. | ||
|
||
```java | ||
|
||
QueryStringParserCallbackBuilder<SomeObject> builder = new QueryStringParserCallbackBuilder<SomeObject>(); | ||
builder.addCallback("userid", new QueryStringParserCallback<SomeObject>() { | ||
@Override | ||
public void parseKeyValuePair(String queryString, int keyStart, int keyEnd, int valueStart, int valueEnd, SomeObject storage) { | ||
final int userId = ParseUtils.parseInt(queryString, valueStart, valueEnd); | ||
storage.setUserId(userId); | ||
} | ||
}); | ||
|
||
QueryStringParser.parseQueryString(s, builder.buildCallback(), foo); | ||
|
||
``` | ||
|
||
|
||
This example url decodes the string value in "q" | ||
|
||
```java | ||
|
||
String s = "userid=12345&foo=bar&yo=lo&q=hello+world"; | ||
... | ||
QueryStringParserCallbackBuilder<SomeObject> builder = new QueryStringParserCallbackBuilder<SomeObject>(); | ||
builder.addCallback("q", new QueryStringParserCallback<SomeObject>() { | ||
@Override | ||
public void parseKeyValuePair(String queryString, int keyStart, int keyEnd, int valueStart, int valueEnd, SomeObject storage) { | ||
final StringBuilder urlDecodedQuery = new StringBuilder(); | ||
ParseUtils.urlDecodeInto(queryString, valueStart, valueEnd, urlDecodedQuery ); | ||
assert urlDecodedQuery.equals("hello world"); | ||
} | ||
}); | ||
|
||
QueryStringParser.parseQueryString(s, builder.buildCallback() , foo ); | ||
|
||
``` | ||
|
||
Benchmarks | ||
------------ | ||
Benchmarks comparing the runtime speed and gc stats are in src/test. `KeyValueParsingBenchmark` runs the parsing code over a million query strings and prints time and gc stats. Use `runKeyValueParsingBenchmark.sh` to run it. When it is run with the argument "Indeed" it uses `IndeedKeyValueParser` to parse those query strings. When no argument is given it uses `StringSplitKeyValueParser` which implements parsing with Java's `String.split` and `UrlDecoder.decode`. Our benchmark shows that `IndeedKeyValueParser` is about 4X faster than `StringSplitKeyValueParser` | ||
|
||
``` | ||
./runKeyValueParsingBenchmark.sh Indeed | ||
``` | ||
and | ||
``` | ||
./runKeyValueParsingBenchMark.sh | ||
``` | ||
We have also included `runNumParsingBenchmark.sh` for benchmarking the numeric parsing utilities in ParseUtils. It compares the `parseInt` and `parseFloat` methods in `ParseUtils` to Java's `Integer.parseInt` and `Float.parseFloat`. Usage is similar to the key value parsing benchmark. | ||
|
||
``` | ||
./runNumParsingBenchmark.sh Indeed | ||
``` | ||
and | ||
``` | ||
./runNumParsingBenchmark.sh | ||
``` | ||
JVM options are set inside these scripts at the top, like `export MAVEN_OPTS='-Xmx64M -XX:+PrintGCDetails -verbose:gc'` | ||
|
||
## Custom delimiters | ||
`QueryStringParser` also has a parse method that accepts custom delimiters, instead of the default "&" and "=". For example if you had data like: | ||
|
||
``` | ||
foo:bar%rad:boo%baz:quz | ||
``` | ||
you could parse it using: | ||
``` | ||
QueryStringParser.parseQueryString(queryString, queryStringParser, foo, "%", ":"); | ||
``` | ||
## Dependencies | ||
|
||
- guava (15 ok) | ||
- log4j | ||
- it.unimi.dsi's fastutil | ||
- junit-dep (4.X) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,44 @@ | ||
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd"> | ||
<modelVersion>4.0.0</modelVersion> | ||
|
||
<parent> | ||
<groupId>com.indeed</groupId> | ||
<artifactId>util-parent</artifactId> | ||
<version>1.0.4-SNAPSHOT</version> | ||
</parent> | ||
|
||
<artifactId>util-urlparsing</artifactId> | ||
<name>urlparsing</name> | ||
<description> | ||
Utility methods to effeciently parse url param key value pairs | ||
</description> | ||
|
||
<scm> <!-- prevent Maven from trying to override with subproject suffix --> | ||
<url>${project.parent.scm.url}</url> | ||
<connection>${project.parent.scm.connection}</connection> | ||
<developerConnection>${project.parent.scm.developerConnection}</developerConnection> | ||
</scm> | ||
|
||
<dependencies> | ||
<dependency> | ||
<groupId>log4j</groupId> | ||
<artifactId>log4j</artifactId> | ||
</dependency> | ||
<dependency> | ||
<groupId>it.unimi.dsi</groupId> | ||
<artifactId>fastutil</artifactId> | ||
<version>6.2.2</version> | ||
</dependency> | ||
<dependency> | ||
<groupId>com.google.guava</groupId> | ||
<artifactId>guava</artifactId> | ||
</dependency> | ||
<dependency> | ||
<groupId>junit</groupId> | ||
<artifactId>junit-dep</artifactId> | ||
<scope>test</scope> | ||
</dependency> | ||
|
||
</dependencies> | ||
</project> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
#!/bin/bash | ||
|
||
export MAVEN_OPTS="-Xmx64M -XX:+PrintGCDetails -verbose:gc" | ||
|
||
if [[ "Indeed" == $1 ]] | ||
then | ||
INDEED_ARG="-Dexec.arguments=ind"; | ||
fi | ||
|
||
if [ ! -f src/test/resources/logentries.txt.gz ]; then | ||
echo "Downloading benchmark data from AWS, this could take a while" | ||
wget -P src/test/resources 'https://s3.amazonaws.com/indeed-open-source/logentries.txt.gz' | ||
fi | ||
|
||
mvn clean package | ||
mvn exec:java -Dexec.mainClass="com.indeed.util.urlparsing.benchmark.KeyValueParsingBenchmark" -Dexec.classpathScope="test" $INDEED_ARG | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
#!/bin/bash | ||
|
||
export MAVEN_OPTS="-Xmx64M -XX:+PrintGCDetails" | ||
|
||
if [[ "Indeed" == $1 ]] | ||
then | ||
INDEED_ARG="-Dexec.arguments=ind"; | ||
fi | ||
mvn clean package | ||
|
||
mvn exec:java -Dexec.mainClass="com.indeed.util.urlparsing.benchmark.NumberParsingBenchmark" -Dexec.classpathScope="test" $INDEED_ARG | ||
|
Oops, something went wrong.