Name	Name	Last commit message	Last commit date
Latest commit History 404 Commits
analyzer	analyzer
benchmarks	benchmarks
devtools	devtools
examples	examples
udfs	udfs
webapp	webapp
.editorconfig	.editorconfig
.gitignore	.gitignore
.travis.yml	.travis.yml
CHANGELOG	CHANGELOG
LICENSE	LICENSE
MAKING_NEW_RULES.md	MAKING_NEW_RULES.md
README-LogParser.md	README-LogParser.md
README-Pig.md	README-Pig.md
README-Platfora.md	README-Platfora.md
README.md	README.md
ReleaseProcedure.txt	ReleaseProcedure.txt
pom.xml	pom.xml
regen-all.sh	regen-all.sh

Yauaa: Yet Another UserAgent Analyzer

This is a java library that tries to parse and analyze the useragent string and extract as many relevant attributes as possible.

The resulting output fields can be classified into several categories:

The Device: The hardware that was used.
The Operating System: The base software that runs on the hardware
The Layout Engine: The underlying core that converts the 'HTML' into a visual/interactive
The Agent: The actual "Browser" that was used.
Extra fields: In some cases we have additional fields to describe the agent. These fields are among others specific fields for the Facebook and Kobo apps, and fields to describe deliberate useragent manipulation situations (Anonymization, Hackers, etc.)

Note that not all fields are always available. So if you look at a specific field you will in general find null values and "Unknown" in there as well.

There are as little as possible lookup tables included the system really tries to analyze the useragent and extract values from it. The aim of this approach is to have a system that can classify as much traffic as possible yet require as little as possible maintenance because all versions and in many places also the names of the used components are extracted without knowing them beforehand.

Example output

As an example the useragent of my phone:

Mozilla/5.0 (Linux; Android 7.0; Nexus 6 Build/NBD90Z) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.124 Mobile Safari/537.36

is converted into this set of fields:

Field name	Value
DeviceClass	'Phone'
DeviceName	'Nexus 6'
OperatingSystemClass	'Mobile'
OperatingSystemName	'Android'
OperatingSystemVersion	'7.0'
OperatingSystemNameVersion	'Android 7.0'
OperatingSystemVersionBuild	'NBD90Z'
LayoutEngineClass	'Browser'
LayoutEngineName	'Blink'
LayoutEngineVersion	'53.0'
LayoutEngineVersionMajor	'53'
LayoutEngineNameVersion	'Blink 53.0'
LayoutEngineNameVersionMajor	'Blink 53'
AgentClass	'Browser'
AgentName	'Chrome'
AgentVersion	'53.0.2785.124'
AgentVersionMajor	'53'
AgentNameVersion	'Chrome 53.0.2785.124'
AgentNameVersionMajor	'Chrome 53'

Performance

On my i7 system I see a speed ranging from 500 to 4000 useragents per second (depending on the length and ambiguities in the useragent). On average the speed is around 2000 per second or ~0.5ms each. A LRU cache is in place that does over 1M per second if they are in the cache.

Output from the benchmark ( using this code ) on a Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz:

Benchmark	Mode	Cnt	Score		Error	Units
AnalyzerBenchmarks.android6Chrome46	avgt	10	0.538	±	0.021	ms/op
AnalyzerBenchmarks.androidPhone	avgt	10	0.688	±	0.025	ms/op
AnalyzerBenchmarks.googleAdsBot	avgt	10	0.111	±	0.002	ms/op
AnalyzerBenchmarks.googleAdsBotMobile	avgt	10	0.385	±	0.021	ms/op
AnalyzerBenchmarks.googleBotMobileAndroid	avgt	10	0.575	±	0.021	ms/op
AnalyzerBenchmarks.googlebot	avgt	10	0.199	±	0.004	ms/op
AnalyzerBenchmarks.hackerSQL	avgt	10	0.096	±	0.003	ms/op
AnalyzerBenchmarks.hackerShellShock	avgt	10	0.084	±	0.002	ms/op
AnalyzerBenchmarks.iPad	avgt	10	0.344	±	0.011	ms/op
AnalyzerBenchmarks.iPhone	avgt	10	0.341	±	0.006	ms/op
AnalyzerBenchmarks.iPhoneFacebookApp	avgt	10	0.695	±	0.026	ms/op
AnalyzerBenchmarks.win10Chrome51	avgt	10	0.307	±	0.011	ms/op
AnalyzerBenchmarks.win10Edge13	avgt	10	0.336	±	0.013	ms/op
AnalyzerBenchmarks.win10IE11	avgt	10	0.282	±	0.008	ms/op
AnalyzerBenchmarks.win7ie11	avgt	10	0.279	±	0.007	ms/op

In the canonical usecase of analysing clickstream data you will see a <1ms hit per visitor (or better: per new non-cached useragent) and for all the other clicks the values are retrieved from this cache at close to 0 time.

Using the analyzer

In addition to the UDFs for Apache Pig and Platfora (see below) this analyzer can also be used in Java based applications.

First add the library as a dependency to your application. This has been published to maven central so that should work in almost any environment.

<dependency>
  <groupId>nl.basjes.parse.useragent</groupId>
  <artifactId>yauaa</artifactId>
  <version>0.12</version>
</dependency>

and in your application you can use it as simple as this

    UserAgentAnalyzer uaa = new UserAgentAnalyzer();

    UserAgent agent = uaa.parse("Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.11) Gecko/20071127 Firefox/2.0.0.11");

    for (String fieldName: agent.getAvailableFieldNamesSorted()) {
        System.out.println(fieldName + " = " + agent.getValue(fieldName));
    }

Note that not all fields are available after every parse. So be prepared to receive a 'null' if you extract a specific name.

IMPORTANT: This library is NOT threadsafe/reentrant! So if you need it in a multi threaded situation you either need to synchronize using it or create a separate instance per thread.

Limiting to only certain fields

In some scenarios you only want a specific field and all others are unwanted. This can be achieved by creating the analyzer in Java like this:

UserAgentAnalyzer uaa;
public ThreadState() {
    uaa = UserAgentAnalyzer
            .newBuilder()
            .withoutCache()
            .withField("DeviceClass")
            .withField("AgentNameVersionMajor")
            .build();

One important effect is that this speeds up the system because it will kick any rules that do not help in getting the desired fields. The above example showed an approximate 40% speed increase (i.e. times dropped from ~1ms to ~0.6ms).

User Defined Functions

Several external computation systems support the concept of a User Defined Function (UDF). A UDF is simply a way of making functionality (in this case the analysis of useragents) available in such a system.

For two such systems Apache Pig and Platfora (both are used within bol.com (where I work)) we have written such a UDF which are both part of this project.

UDFs written by other people:

Apache Drill https://github.com/cgivre/drill-useragent-function

Values explained

DeviceClass

Value	Meaning
Desktop	The device is assessed as a Desktop/Laptop class device
Anonymized	In some cases the useragent has been altered by anonimization software
Unknown	We really don't know, these are usually useragents that look normal yet contain almost no information about the device
Mobile	A device that is mobile yet we do not know if it is a eReader/Tablet/Phone or Watch
Tablet	A mobile device with a rather large screen (common > 7")
Phone	A mobile device with a small screen (common < 7")
Watch	A mobile device with a tiny screen (common < 2"). Normally these are an additional screen for a phone/tablet type device.
Virtual Reality	A mobile device with a VR capabilities
eReader	Similar to a Tablet yet in most cases with an eInk screen
Set-top box	A connected device that allows interacting via a TV sized screen
TV	Similar to Set-top box yet here this is built into the TV
Game Console	'Fixed' game systems like the PlayStation and XBox
Handheld Game Console	'Mobile' game systems like the 3DS
Robot	Robots that visit the site
Robot Mobile	Robots that visit the site indicating they want to be seen as a Mobile visitor
Spy	Robots that visit the site pretending they are robots like google, but they are not
Hacker	In case scripting is detected in the useragent string, also fallback in really broken situations

Parsing Useragents

Parsing useragents is considered by many to be a ridiculously hard problem. The main problems are:

Although there seems to be a specification, many do not follow it.
Useragents LIE that they are their competing predecessor with an extra flag.

The pattern the 'normal' browser builders are following is that they all LIE about the ancestor they are trying to improve upon.

The reason this system (historically) works is because a lot of website builders do a very simple check to see if they can use a specific feature.

if (useragent.contains("Chrome")) {
   // Use the chrome feature we need.
}

Some may improve on this an actually check the (major) version that follows.

A good example of this is the Edge browser:

Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.135 Safari/537.36 Edge/12.10136

It says it:

is Mozilla/5.0
uses AppleWebKit/537.36
for "compatibility" the AppleWebKit lie about being "KHTML" and that it is siliar to "Gecko" are also copied
is Chrome 42
is Safari 537
is Edge 12

So any website looking for the word it triggers upon will find it and enable the right features.

How many other analyzers work

When looking at most implementations of analysing the useragents I see that most implementations are based around lists of regular expressions. These are (in the systems I have seen) executed in a specific order to find the first one that matches.

In this solution direction the order in which things occur determines if the patterns match or not.

Regular expressions are notoriously hard to write and debug and (unless you make them really complex) the order in which parts of the pattern occur is fixed.

Core design idea

I wanted to see if a completely different approach would work: Can we actually parse these things into a tree and work from there.

The parser (ANTLR4 based) will be able to parse a lot of the agents but not all. Tests have shown that it will parse >99% of all useragents on a large website which is more than 99.99% of the traffic.

Now the ones that it is not able to parse are the ones that have been set manually to a invalid value. So if that happens we assume you are a hacker. In all other cases we have matchers that are triggered if a sepcific value is found by the parser. Such a matcher then tells this class is has found a match for a certain attribute with a certain confidence level (0-10000). In the end the matcher that has found a match with the highest confidence for a value 'wins'.

High level implementation overview

The main concept of this useragent parser is that we have two things:

A Parser (ANTLR4) that converts the useragent into a nice tree through which we can walk along.
A collection of matchers.

A matcher triggers if a set of patterns is present in the tree.
Each pattern is detected by a "matcher action" that triggers and can fill a single attribute. If a matcher triggers a set of attributes get set with a value and a confidence level
All results from all triggered matchers (and actions) are combined and for each individual attribute the 'highest value' wins.

As a performance optimization we walk along the parsed tree once and fire everything we find into a precomputed hashmap that points to all the applicable matcher actions. As a consequence

the matching is relatively fast even though the number of matchers already runs into the few hundreds.
the startup is "slow"
the memory footprint is pretty big due to the number of matchers, the size of the hashmap and the cache of the parsed useragents.

License

Yet Another UserAgent Analyzer
Copyright (C) 2013-2017 Niels Basjes

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Yauaa: Yet Another UserAgent Analyzer

Example output

Performance

Using the analyzer

Limiting to only certain fields

User Defined Functions

Values explained

DeviceClass

Parsing Useragents

How many other analyzers work

Core design idea

High level implementation overview

License

About

Releases 60

Sponsor this project

Contributors 17

Languages

License

nielsbasjes/yauaa

Folders and files

Latest commit

History

Repository files navigation

Yauaa: Yet Another UserAgent Analyzer

Example output

Performance

Using the analyzer

Limiting to only certain fields

User Defined Functions

Values explained

DeviceClass

Parsing Useragents

How many other analyzers work

Core design idea

High level implementation overview

License

About

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Releases 60

Sponsor this project

Contributors 17

Languages