Up

Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow

Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow
File Size:
209.31 kB
Author:
Jakub Jurkiewicz, Aleksander Nowiński
Email:
{J[dot]Jurkiewicz,A[dot]Nowinski}[at]icm[dot]edu[dot]pl
Date:
11 March 2014
Downloads:
48 x

Abstract: An article focuses on the new methods for automatic processing and analysis of the scientific papers. It covers the very first part of this task – discovery and harvesting of scientific publications from the internet. Article is focused on discovery and analysis of the html documents to identify publication resources. Usage of data from Common Crawl project allows operating on large subset of the web pages without a need to perform an expensive crawl of the WWW. We present methods for automatic identification of pages describing scholarly documents in WWW network using html meta headers. Presented set of rules applied to the data achieves reasonable quality. A system based on these tools is also presented. It allows easy operating and transferring output to the COntent ANalysis SYStem(CoAnSys) - a processing and analysis system developed in ICM. For achieving this goal set of MapReduce tasks running with Hadoop And Ozzie has been used. The quality and efficiency of described rules are discussed. Finally future challenges for our system are presented.

Keywords: Hadoop, web mining, scientific content finding, web page classification

Area: Electronics, Telecommunications and Informatics

BibTeX:

@article{Jurkiewicz2013,
  author = {Jakub Jurkiewicz and Aleksander Nowiński},
  title = {Towards Finding Scholarly Articles in Internet Using Hadoop MapReduce with Oozie Workflow},
  journal = {Challenges of Modern Technology},
  year = {2013},
  volume = {4},
  number = {4},
  pages = {3--6},
  url = {http://www.journal.young-scientists.eu/index.php/isuues/file/161-towards-finding-scholarly-articles-in-internet-using-hadoop-mapreduce-with-oozie-workflow}
}

 

 
 

License Agreement

I agree to the terms listed above
 
 
Powered by Phoca Download