Skip to content

Commit

Permalink
README modified to add dependencies and contributions section, modifi…
Browse files Browse the repository at this point in the history
…ed and deleted some sections, gitignore updated to ignore all kind of output files
  • Loading branch information
Khusbu Mishra committed Oct 28, 2017
1 parent 83c7754 commit 41545b8
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 42 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# pyc files
*.pyc
# data file
dblp_data*.jl
*.jl
# pycharm files
idea/*.xml
idea/*.iml
Expand Down
63 changes: 22 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,7 @@
# dblp-spider
Collection of co-authorship network from http://dblp.uni-trier.de/. The site stores
information about the author who have worked together on a research topic. You can
visit the website and check it out for more information about the kind of data
that is present there.

A spider written using [scrapy](https://scrapy.org/) that crawls [dblp](http://dblp.uni-trier.de/) website to extract various data about the authors like his coauthor names, communities
to which he belong, articles that he has published. It is then used to build a co-authorship network graph.

### Example of a co-authorship network
```
Expand All @@ -11,59 +10,40 @@
...
```
The above example denotes an edge-list (author_one->author_two). Such an edge-list
denotes a graph of co-authors who have worked on a paper together. The file being
generated when you run the command is in json format.
denotes a graph of co-authors who have worked on a paper together.

## Dependencies

## What does the spider do?
The code here, crawls the website of dblp, collects coauthors name list, their communities
to which they belong, articles that they have published together. You can modify it and extract
even more information from the web page.
- [scrapy](https://doc.scrapy.org/en/latest/intro/install.html)
- [python 2.7](https://www.python.org/)

## How to clone the repository
Simply, run the following command:
```
>>> git clone https://github.com/SiddharthaAnand/dblp-spider.git
$ git clone https://github.com/SiddharthaAnand/dblp-spider.git
```
This will clone this repository to your local system.
This will clone this repository to your local system.

## How to run the code
Make sure you are in the working directory of dblp-spider. The working directory looks like this:
```
>>> tree
|-- coauthornetwork
| |-- __init__.py
| |-- items.py
| |-- middlewares.py
| |-- pipelines.py
| |-- settings.py
| `-- spiders
| |-- example.py
| `-- __init__.py
|-- dblp_data1.jl
|-- dblp_data.jl
|-- LICENSE.md
|-- README.md
`-- scrapy.cfg
Make sure you are in the working directory of dblp-spider. Then run the following command:
```
Run the following command:
```
>>> scrapy crawl dblpspider [-o] [filename]
$ scrapy crawl dblpspider [-o] [filename]
```

This will start the spider, send requests asynchronously and receive data and store the
output (denoted by '-o' in the filename given by you.
output (denoted by '-o' in the filename given by you).

For example:
```
>>> scrapy crawl dblpspider -o dblp_data.jl
$ scrapy crawl dblpspider -o dblp_data.jl
```

### Sample json data
This is the sample data that you might get after the crawl is over.
You can optionally use the in-built json package to pretty print the
contents of the json file. You can google it to know how to use it.
You can optionally use the [in-built json package](https://docs.python.org/2/library/json.html) to pretty print the
contents of the json file.
```
>>> head dblp_json.jl
$ head dblp_json.jl
{
"author_articles_published": [
"Spanning tree-based fast community detection methods in social networks."
Expand All @@ -86,14 +66,15 @@ contents of the json file. You can google it to know how to use it.
}
...
```
## Authors
* [Khusbu Mishra](https://github.com/Khusbu)
* [Siddhartha Anand](https://github.com/SiddharthaAnand)

## Licence
This project is licensed under Apache Licence - see the [LICENSE.md](/LICENSE.md) for more details.

## Future enhancements
* Add a no-sql db to insert data
* Add a no-sql db to insert data
* Deploy the spider on a server for large scale crawl
* Extract more data from dblp
* Visualize the data using a visualization tool

## Contributions
Any kind of contribution or suggestion are always welcome. You can modify it and extract even more data from [dblp](http://dblp.uni-trier.de/).

0 comments on commit 41545b8

Please sign in to comment.