Linkedface crawler

Introduction

Linkedface crawler is production, ready-to-use web and image crawler, based on Apache Nutch 2.x. We add some extra-processing into Nutch, make it easy to use.
User only have to lunch AMI (Amazon Machine Image) in Amazon market place at: https://aws.amazon.com/marketplace/pp/B0727KTRWQ, and then you can have completed crawler solution in your hand. After initiating crawler machine by AMI, you can start crawler immediately by command line or run crontab job and check the result.
We can offer intelligence processing with face indexing and searching (face detection, face features extraction, and face search) in extension modules.

Environment

Apache Nutch 2.3
Hbase: to store raw crawler data
Python: for data processing
MySQL server: to store crawled document: webpages and images information
phpmyadmin: by default, it’s turned off to optimize server performance.
nginx server: to view crawled image. Default service status is off.

Processing steps

These are processing steps for each round:

Crawl data: web pages and images
Remove small images (thumb images, low quality images) and save image into files
Extract webpage information: publishing date, body text, header information (keywords, description, host, score – page rank)
Remove duplicate images (in cases, one image can be served in multi – mirror servers)
Categorize webpages and then images
Export crawler data: image + mysql data

linkedface_cycle

Running crawler

You can run crawler immediately by one of following methods:

Run by command line:

reboot crawler server
run command line: linkedface number_of_rounds

Run by crontab

Edit crontab file: /etc/cron.d/linkedface
Un-comment last two lines and set running schedule:

# /etc/cron.d/linkedface: crontab entries for the linkedface crawler
SHELL=/bin/bash
PATH=/home/ubuntu/bin:/home/ubuntu/.local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
MAIL=/var/mail/ubuntu
JAVA_HOME=/usr/lib/jvm/default-java/
LANG=en_US.UTF-8
HOME=/home/ubuntu
SHLVL=2
LANGUAGE=en_US.UTF-8

#40 * * * * ubuntu /home/ubuntu/labsofthings/crawler/bin/hbase_cleaner
#40 * * * * ubuntu /home/ubuntu/labsofthings/crawler/bin/start.sh 3

Reboot your server, and it will run as schedule

Output data

There are two data:

Database

Webpage document and image information are stored in MySQL database installed locally in crawler server

DB name: crawler
Tables: Images and webpages table.

You can view table structure by SQL command line or by phpmyadmin. Data will be exported into sql dump file for each round of crawling. Only data generated in a round is dumped.

Image files

Images are stored in following folders:

├── imgs        // For original images
└── thumbs 
    ├── 150     // For generated thumbnail images – width: 150 px
    └── 300     // For generated thumbnail images – width: 300 px

All these data will be packaged in to compressed file, and store in folder:

/labsofthings/data/crawler_data/yyyy-MM-dd/

Option configuration

Seed file

All starting urls for crawling are stored in file: /home/ubuntu/labsofthings/seed/urls.txt

You can add more starting urls here, each url in one line.

Data folder

To crawl large amount of data, you have to store data folder in big storage device. You can do it in these steps:

Create large AWS storage device ( for example: st1 ) and attach to machine
Create folders in this device as bellow structure under user ubuntu:

├── data
│   ├── crawler_data
│   ├── hbase_data
│   └── zookeeper_data
└── nginx_img

Create soft links:

#: ln -s /labsofthings/data/ data
#: ln -s /labsofthings/nginx_img/ nginx_img

Exporting data

You can copy packaged data file of each crawl round into destionation server by steps:

Open file: /home/ubuntu/labsofthings/indexer/index.sh
Un-comment line 42 and configure information for destination server:

# Copy data into remote gateway server
#scp -i $key $1.zip $gatewayUser@$gatewayIP:$gatewayFolder/$1.zip

Others

Remove duplication
Categorize document

View crawl result

Checking result data is very easy. You need turn on phpMyAdmin and Nginx server to do it.

Note:

These operations are security risks. They are only for testing and after testing, turn services off.

Nginx

Stop/start nginx by command: sudo service nginx start / stop
Access url: http://your_ip:9090

Sample nginx screen:

nginx

phpmyadmin

Stop/start phpmyadmin by command: sudo service apache2 start / stop

Access url: http://your_ip/phpmyadmin
DB name: crawler
Account: root / lot2016

Sample phpmyadmin screen:

phpmyadmin

Extension support

We can offer face detection and face feature extraction in an extension module to help you make face search service.
Please contact us for more information.

Notable Relevant Projects

https://linkedface.com: Image news and face search service

Conntact us

You can contact me for more information:

Name: Thuc X.Vu
Organization: IoT and Data processing Labs

labsofthings

Addess: Toong office, 4th Floor, No. 8 Trang Thi, Hoan Kiem Dist., Hanoi, Vietnam
Tel: +84 4 3926 3888
Mobile: +84 912 083 463
Email: thuc@labsofthings.com
Website: http://labsofthings.com, or: https://linkedface.com