Skip to content

Uncovered significant behavioral patterns by implementing the A-Priori algorithm to extract over 750 association rules from the 2018 NYC Central Park squirrel census.

License

Notifications You must be signed in to change notification settings

Ronitt272/Behavioral-Pattern-Mining-in-NYC-Squirrel-Data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Behavioral-Pattern-Mining-in-NYC-Squirrel-Data

This project was a team effort for the COMS6111 - Advanced Database Systems course at Columbia University.

Team members:

Ronitt Mehra (UNI: rm4084)
Yueran Ma (UNI: ym2876)

Steps to run the program

git clone https://github.com/Ronitt272/Behavioral-Pattern-Mining-in-NYC-Squirrel-Data.git
cd Behavioral-Pattern-Mining-in-NYC-Squirrel-Data
pip install -r requirements.txt
python3 main.py squirrel.csv <support_threshold> <confidence_threshold>

<support_threshold> => The minimum percentage of occurrences an itemset must have in the 2018 Central Park Squirrel Census data to be considered frequent. For example, a support threshold of 70% means an itemset must appear in at least 70% of the rows to be included in the analysis.

<confidence_threshold> => The minimum percentage that measures the reliability of the association rule in the 2018 Central Park Squirrel Census data. For example, a confidence threshold of 80% means that for the rule to be considered strong, the consequent must appear in at least 80% of the cases where the antecedent is present.

Both <support_threshold> and <confidence_threshold> must be between 0 and 1.0

Dataset

We have used the 2018 Central Park Squirrel Census Data on NYC Open Data.
Dataset Link: https://data.cityofnewyork.us/Environment/2018-Central-Park-Squirrel-Census-Squirrel-Data/vfnx-vebw/about_data

Preprocessing

To generate squirrel.csv, we first drop the following columes which contain unnecessary information for generating association rules: ['X', 'Y', 'Unique Squirrel ID', 'Hectare', 'Date', 'Hectare Squirrel Number', 'Primary Fur Color', 'Highlight Fur Color', 'Color notes', 'Above Ground Sighter Measurement', 'Other Activities', 'Other Interactions', 'Lat/Long']

Then we convert all boolean values in the dataset to strings that are more explainable for an association rule. For example, in the colume 'Running', True is converted to "running" and False is converted to "not running". Following is the list of columes that we did the conversion as described above: ['Running', 'Chasing', 'Climbing', 'Eating', 'Foraging', 'Kuks', 'Quaas', 'Moans', 'Tail flags', 'Tail twitches', 'Approaches', 'Indifferent', 'Runs from']

Our dataset is compelling because it would be interesting to see how different characteristics of a squirrel affect its behavior when encountering a human. It would be valuable information for the study of squirrel in the field of biology, psychology, and animal science.

Design Description

In our program, we implement the a-priori algorithm as described in Section 2.1 of the Agrawal and Srikant paper in VLDB 1994. However, we have not implemented the subset function using hash tree as described in Section 2.1.2.

In the apriori function, we start with 1-itemset and store all frequent itemsets in freq_itemsets. We call function apriori_gen in the loop to generate all candidate itemsets for k+1. In the apriori_gen function, we implement both the join step and prune step. After obtaining the candidates, we check each row of the dataset and count the row that contains the candidate. We continue the loop until we cannot find any large itemset for k. Then we iterate through all large itemsets and generate association rules by considering different k-1 subsets of the itemsets as left side. We can calculate the confidence of a rule by retrieving the support for each itemset that we have saved when generating the itemset. Finally, we sort the frequent itemsets by support and the rules by confidence.

Identified Patterns

The example-run.txt file generated by setting the support and confidence thresholds to 70% and 80%, respectively, contains the frequent itemsets and the high-confidence association rules.

To produce example-run.txt, run:

python3 main.py squirrel.csv 0.7 0.8

From the association rules, it is interesting to see that the 'Adult' attribute is often associated with 'not approaching human', 'not moaning', 'not flagging tail', 'not twitching tail', 'not quaaing', etc. It indicates that adult squirrels are often calmer than the younger ones.

We also see that 'not moaning' is associated with a lot of other actions such as 'not kukking', 'not climbing', 'not quaaing', 'not running from human', 'not chasing', etc. It indicates that squirrels only moan in specific instances and the moaning is often combined with other actions.

About

Uncovered significant behavioral patterns by implementing the A-Priori algorithm to extract over 750 association rules from the 2018 NYC Central Park squirrel census.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages