THIS FILE IS UNDER CONSTRUCTION

This is simple project, that crawls the page "http://wiprodigital.com" and all found subpages in the same domain. The result of this crawling is a file, by default named following a pattern "results_dd_MM_yyyy_HH_mm_ss.xml" and located in the same folder as jar file, optionally you can specify other file location as a target, this is done by directly specifying single argument pointing to the file when running the jar.

This output file contains information about the root page and all its subpages which could easily be used to create map of the link connections between the pages.

Structure of the output file

Information about the root page that was crawled is stored under XPATH:

"k.cichocki.wipro.result.Result/baseUrl" Knowing the root page, you can search for a matching url string to map from this page to its subpages. In each XPATH:
"k.cichocki.wipro.result.Result/results/entry/" you will find elements: ** "string" the url of the crawled page.
** "k.cichocki.wipro.result.CrawlingResult" containing information about results of crawling for the above page *** "info" which could be one of CRAWLED, EXTERNAL, MALFORMED_URL, CONNECTION_ERROR. ** "Links" which contain ements: *** "k.cichocki.wipro.linkextractor.Link" which contain **** "url" the crawled url **** "baseUrl" the url of the page that was the origin of this url, from which it was crawled. **** "resource" [true|false] determining if this link is a resource

Requirements:

jdk >= 1.9
maven (I have used 3.6.3, not tested on other version, environment var "JAVA_HOME" should be set to point the jdk root folder)

HOW TO BUILD:

cd to '${project.dir}/' and issue command:

mvn clean package

creates a jars in:

${project.dir}/target/WiproHtmlCrawler.jar
${project.dir}/target/WiproHtmlCrawler-jar-with-dependencies.jar

test results are stored in:

${project.dir}/target/surefire-reports/

How to run the jar

${project.dir}/java -jar target/WiproHtmlCrawler-jar-with-dependencies.jar [OPTIONAL: destination file name]

if run without arguments, creates a file in current dir following a pattern "results_dd_MM_yyyy_HH_mm_ss.xml" containing the crawling results.

Running with single argument will assume it is the path to the output file, for instance invoking:

${project.dir}/java -jar target\WiproHtmlCrawler-jar-with-dependencies.jar out.xml

will result in storing the crawling result in a file:

${project.dir}/out.xml

TO BE DONE

Allow for specifying the page to be crawled
Allow for specifying the number of threads to be used;
Add some other formats of results
Create benchmark for crawling with different concurrency levels.
Could crate a crawler that execute the scripts on the page and detect outgoing request, adding them to the discovered links for checking.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.settings		.settings
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.MD		README.MD
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

THIS FILE IS UNDER CONSTRUCTION

About

Releases

Packages

Contributors 2

Languages

krzys6301/WiproHtmlCrawler

Folders and files

Latest commit

History

Repository files navigation

THIS FILE IS UNDER CONSTRUCTION

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages