Paper Title
Unsupervised Query Result Extraction From Single Web Page
Abstract
This paper presents the problem of extracting data from a Web page containing contiguous structured or semi
structured data records also referred to as Object data (ODATA). One of the objectives is to identify the region containing
contiguous ODATA also referred to as Data region or Object Region (OREG). Next we extract individual data items/fields for
each ODATA and put them into XML file for further processing. This problem has been studied by several researchers.
However, existing methods still have some serious limitations. These methods are either inaccurate, time consuming or make
many assumptions. This paper proposes a novel technique to automate the task of retrieving individual ODATA from the Web
page. It consists of three steps, 1) Predict the target OREG, 2) validate the OREG 3) Identify and extract the attributes of
individual ODATA and put them into the XML file. This approach enables very accurate alignment and extraction of multiple
ODATA. Experimental results using a large number of Web pages from diverse domains show that the proposed technique is
able to segment ODATA (data records), align and extract data from them very accurately.
Index Terms- Data Record Extraction, Information Extraction, Web Content Mining, Semi-Structured data.
Author - Aleem Ansari, Hemlata Vasishtha
Published : Volume-3,Issue-10 ( Oct, 2016 )
DOIONLINE Number - IJAECS-IRAJ-DOIONLINE-5981
View Here
|
|
| |
|
PDF |
| |
Viewed - 45 |
| |
Published on 2016-11-12 |
|
|
|
|
|
|