Skip to content

Commit

Permalink
initial commit of urlparsing project with QueryStringParserCallBack a…
Browse files Browse the repository at this point in the history
…nd other util methods. Added benchmarking code for urlparam parsing, and number parsing utils
  • Loading branch information
Preetha Appan authored and Jack Humphrey committed Jan 29, 2014
1 parent 85665bb commit e3c5e18
Show file tree
Hide file tree
Showing 23 changed files with 2,001,403 additions and 0 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
*.iml
.*.sw?
*/target/
urlparsing/src/test/resources/logentries.txt.gz
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,10 @@ General Java utilities and helper classes

Utility that enables you to expose runtime variables from a running Java application

## [util-urlparsing](https://github.com/indeedeng/util/tree/master/urlparsing)

Utility to efficiently parse key value pairs from query strings in URLs. Also includes fast number parsing and url decoding utilities.

# License

[Apache License Version 2.0](https://github.com/indeedeng/util/blob/master/LICENSE)
Expand Down
1 change: 1 addition & 0 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -29,5 +29,6 @@
<modules>
<module>varexport</module>
<module>util-core</module>
<module>urlparsing</module>
</modules>
</project>
139 changes: 139 additions & 0 deletions urlparsing/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
util-urlparsing

## About

`urlparsing` is a set of classes for efficiently parsing key value params from query strings in URLs without unnecessary intermediate object creation. It also includes number parsing methods in `ParseUtils` that can parse floats, ints and longs from a query string. Benchmarks are in the test folder along with sample data.

## Motivation

Java versions 1.6 and lower have a [significant flaw](http://stackoverflow.com/questions/1281549/memory-leak-traps-in-the-java-standard-api/1281569#1281569) that leads to inefficient memory usage when using the `String.substring` method. We process large strings containing key value pairs of log event data, where we only need much smaller substrings from it. The flaw in the older JVM versions is that they keep the larger char[] array around even after no reference to the original String exists, which led to unnecessary out of memory errors in our log processing code. `QueryStringParser` was written to solve this problem.
Note that this issue has been addressed in the newest versions of Java 1.7.

Furthermore, our query parsing benchmark shows nearly 4X speedup over a naive Java implementation using `String.split` under significant heap space constraints. It can parse million key-value pairs in under 3 seconds given a max heap of only `-Xmx64M`. Our number parsing benchmark shows over 2X speedup compared to equivalent methods like `Integer.parseInt` and `Float.parseFloat`

## Usage

The class `QueryStringParser` has a parse method that accepts a callback interface`QueryStringParserCallBack<T>`. To parse any query string, you'll need to implement that interface. When the parse method finds a key, it uses the callback to provide start and end offsets in the string for the key and value. Implementors of this interface can then set the parsed value in the provided `<T>`storage class.

Multiple callbacks can be chained together using `QueryStringParserCallbackBuilder`. The advantage of using a callback pattern here is that you can parse just the keys you are interested in from a longer query string.

For example:

```java
public class Foo {
String stringValue;
int intValue;
}

private static final QueryStringParserCallback<Foo> stringValueParser = new QueryStringParserCallback<Foo>() {
@Override
public void parseKeyValuePair(String urlParams, int keyStart, int keyEnd, int valueStart, int valueEnd, Foo storage) {
storage.stringValue = urlParams.substring(valueStart, valueEnd);
}
};

private static final QueryStringParserCallback<Foo> intValueParser = new QueryStringParserCallback<Foo>() {
@Override
public void parseKeyValuePair(String urlParams, int keyStart, int keyEnd, int valueStart, int valueEnd, Foo storage) {
storage.intValue = ParseUtils.parseUnsignedInt(urlParams, valueStart, valueEnd);
}
};

public void parse(String logentry) {
QueryStringParserCallbackBuilder<Storage> builder = new QueryStringParserCallbackBuilder<Foo>();
builder.addCallback("stringKey", stringValueParser);
builder.addCallback("intKey", intValueParser);

final String queryString = "a=x&b=y&foo=bar&stringKey=hello&intKey=111&foobar=1";

final QueryStringParserCallback<Storage> queryStringParser = builder.buildCallback();
final Foo foo = new Foo();
QueryStringParser.parseQueryString(queryString, queryStringParser, foo);
assert foo.intValue == 111;
assert foo.stringValue.equals("hello");

}
..
```

In the above parse method, foo.stringValue will be set to "hello" and storage.intValue will be set to 111. Note that the rest of the keys are essentially ignored because we only added callbacks for two of them.

## ParseUtils
ParseUtils includes static utility methods to parse integers, longs and floating points from strings efficiently. It also includes a method to url-decode strings. All these methods avoid intermediate string object creation when parsing numbers from strings. Use them inside the query parser callback described above. The following examples illustrate this.

This example parses an integer inside a callback registered for the "userid" key using `ParseUtils.parseInt`. It avoids an intermediate object created by `queryString.substring(valueStart, valueEnd)`, which is unavoidable if using `Integer.parseInt(s)` to parse instead.

```java

QueryStringParserCallbackBuilder<SomeObject> builder = new QueryStringParserCallbackBuilder<SomeObject>();
builder.addCallback("userid", new QueryStringParserCallback<SomeObject>() {
@Override
public void parseKeyValuePair(String queryString, int keyStart, int keyEnd, int valueStart, int valueEnd, SomeObject storage) {
final int userId = ParseUtils.parseInt(queryString, valueStart, valueEnd);
storage.setUserId(userId);
}
});

QueryStringParser.parseQueryString(s, builder.buildCallback(), foo);

```


This example url decodes the string value in "q"

```java

String s = "userid=12345&foo=bar&yo=lo&q=hello+world";
...
QueryStringParserCallbackBuilder<SomeObject> builder = new QueryStringParserCallbackBuilder<SomeObject>();
builder.addCallback("q", new QueryStringParserCallback<SomeObject>() {
@Override
public void parseKeyValuePair(String queryString, int keyStart, int keyEnd, int valueStart, int valueEnd, SomeObject storage) {
final StringBuilder urlDecodedQuery = new StringBuilder();
ParseUtils.urlDecodeInto(queryString, valueStart, valueEnd, urlDecodedQuery );
assert urlDecodedQuery.equals("hello world");
}
});

QueryStringParser.parseQueryString(s, builder.buildCallback() , foo );

```

Benchmarks
------------
Benchmarks comparing the runtime speed and gc stats are in src/test. `KeyValueParsingBenchmark` runs the parsing code over a million query strings and prints time and gc stats. Use `runKeyValueParsingBenchmark.sh` to run it. When it is run with the argument "Indeed" it uses `IndeedKeyValueParser` to parse those query strings. When no argument is given it uses `StringSplitKeyValueParser` which implements parsing with Java's `String.split` and `UrlDecoder.decode`. Our benchmark shows that `IndeedKeyValueParser` is about 4X faster than `StringSplitKeyValueParser`

```
./runKeyValueParsingBenchmark.sh Indeed
```
and
```
./runKeyValueParsingBenchMark.sh
```
We have also included `runNumParsingBenchmark.sh` for benchmarking the numeric parsing utilities in ParseUtils. It compares the `parseInt` and `parseFloat` methods in `ParseUtils` to Java's `Integer.parseInt` and `Float.parseFloat`. Usage is similar to the key value parsing benchmark.

```
./runNumParsingBenchmark.sh Indeed
```
and
```
./runNumParsingBenchmark.sh
```
JVM options are set inside these scripts at the top, like `export MAVEN_OPTS='-Xmx64M -XX:+PrintGCDetails -verbose:gc'`

## Custom delimiters
`QueryStringParser` also has a parse method that accepts custom delimiters, instead of the default "&" and "=". For example if you had data like:

```
foo:bar%rad:boo%baz:quz
```
you could parse it using:
```
QueryStringParser.parseQueryString(queryString, queryStringParser, foo, "%", ":");
```
## Dependencies

- guava (15 ok)
- log4j
- it.unimi.dsi's fastutil
- junit-dep (4.X)
44 changes: 44 additions & 0 deletions urlparsing/pom.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>

<parent>
<groupId>com.indeed</groupId>
<artifactId>util-parent</artifactId>
<version>1.0.4-SNAPSHOT</version>
</parent>

<artifactId>util-urlparsing</artifactId>
<name>urlparsing</name>
<description>
Utility methods to effeciently parse url param key value pairs
</description>

<scm> <!-- prevent Maven from trying to override with subproject suffix -->
<url>${project.parent.scm.url}</url>
<connection>${project.parent.scm.connection}</connection>
<developerConnection>${project.parent.scm.developerConnection}</developerConnection>
</scm>

<dependencies>
<dependency>
<groupId>log4j</groupId>
<artifactId>log4j</artifactId>
</dependency>
<dependency>
<groupId>it.unimi.dsi</groupId>
<artifactId>fastutil</artifactId>
<version>6.2.2</version>
</dependency>
<dependency>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit-dep</artifactId>
<scope>test</scope>
</dependency>

</dependencies>
</project>

18 changes: 18 additions & 0 deletions urlparsing/runKeyValueParsingBenchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
#!/bin/bash

export MAVEN_OPTS="-Xmx64M -XX:+PrintGCDetails -verbose:gc"

if [[ "Indeed" == $1 ]]
then
INDEED_ARG="-Dexec.arguments=ind";
fi

if [ ! -f src/test/resources/logentries.txt.gz ]; then
echo "Downloading benchmark data from AWS, this could take a while"
wget -P src/test/resources 'https://s3.amazonaws.com/indeed-open-source/logentries.txt.gz'
fi

mvn clean package
mvn exec:java -Dexec.mainClass="com.indeed.util.urlparsing.benchmark.KeyValueParsingBenchmark" -Dexec.classpathScope="test" $INDEED_ARG


12 changes: 12 additions & 0 deletions urlparsing/runNumParsingBenchmark.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
#!/bin/bash

export MAVEN_OPTS="-Xmx64M -XX:+PrintGCDetails"

if [[ "Indeed" == $1 ]]
then
INDEED_ARG="-Dexec.arguments=ind";
fi
mvn clean package

mvn exec:java -Dexec.mainClass="com.indeed.util.urlparsing.benchmark.NumberParsingBenchmark" -Dexec.classpathScope="test" $INDEED_ARG

Loading

0 comments on commit e3c5e18

Please sign in to comment.