Saturday, July 14, 2012

Differences Between SDE and DEPTA

Even though SDE is based on DEPTA, it has some differences with DEPTA in the following parts:

  1. Tag Tree Building
  2. DEPTA uses the sequence of opening tags and their rendered rectangle in the tag tree building, whereas SDE uses HTML DOM parser (Neko HTML Parser). Neko HTML Parser is used to build HTML DOM tree, then SDE will build tag tree based on the previously created HTML DOM tree.

  3. Visual Information
  4. DEPTA exploits visual information in the tree matching and gap checking between data records. SDE doesn't use visual information. Therefore, unlike DEPTA, SDE is unable to extract non-contiguous data records.

  5. Similarity Score Calculation Between Data Records Candidates
  6. In the similarity score calculation between data records candidates, DEPTA only considers tags that contain text. On the other hand, SDE considers all tags in the data records candidates.

SDE can accept the following parameters in order to extract structured data:

  1. similarity threshold: The minimum similarity score between tag trees/subtrees to determine whether those trees/subtrees are generalized nodes/data records or not.
  2. maximum number of nodes in generalized nodes
  3. whether to use content similarity in partial tree alignment or not
  4. whether to ignore formatting tags like B, I, U or not in tag tree building

Saturday, June 9, 2012

Structured Data Extractor API Usage

The main process of SDE is written in AppConsole.java. It is developed based on DEPTA method invented by Yanhong Zhai and Bing Liu. DEPTA extracts structured data from a web page in these following steps:

  1. Build Tag Tree
  2. Mining Data Regions
  3. Mining Data Records
  4. Align Data Items


The general architecture of the DEPTA system (Zhai and Liu, 2006)

1. Build Tag Tree

First, we build a tag tree for the web page. SDE uses NekoHTML Parser to create the DOM tree of the web page, then it creates the tag tree based on the DOM tree. The input parameter is a string parameter in URI format. It refers to the input web page. The ignoreFormattingTags parameter is a boolean parameter. If its value is true, then HTML style formatting tags like B, I, U, etc. will be ignored in the tag tree building (it will be treated as ordinary text).

TagTreeBuilder builder = new DOMParserTagTreeBuilder();
TagTree tagTree = builder.buildTagTree(input, ignoreFormattingTags);

2. Mining Data Regions

After the tag tree has been built, SDE will create a TreeMatcher object that will be used to calculate similarity score between subtrees in the tag tree. SDE uses an implementation of Simple Tree Matching algorithm. The TreeMatcher objec will be passed as a parameter in MiningDataRegions constructor. MiningDataRegions is an implementation of Mining Data Regions algorithm to identify data regions that exists in the web page. To find those data regions, SDE will call findDataRegions method in the MiningDataRegions object. The method will return a list of DataRegion objects; We can customize parameter values used in Mining Data Regions algorithm like maximum node in generalized nodes (default 9) and similarity treshold (default 90%).

TreeMatcher matcher = new SimpleTreeMatching();
DataRegionsFinder dataRegionsFinder = new MiningDataRegions( matcher );
List<DataRegion> dataRegions = dataRegionsFinder.findDataRegions(tagTree.getRoot(), maxNodeInGeneralizedNodes, similarityTreshold);

3. Mining Data Records

For each data region, SDE will extract its data records (rows in a table structure) by calling findDataRecords method in MiningDataRecords object. The method will return an array of DataRecord object.

DataRecordsFinder dataRecordsFinder = new MiningDataRecords( matcher );
DataRecord[][] dataRecords = new DataRecord[ dataRegions.size() ][];

for( int dataRecordArrayCounter = 0; dataRecordArrayCounter < dataRegions.size(); dataRecordArrayCounter++)
{
   DataRegion dataRegion = dataRegions.get( dataRecordArrayCounter );
   dataRecords[ dataRecordArrayCounter ] = dataRecordsFinder.findDataRecords(dataRegion, similarityTreshold);
}

4. Align Data Items

For each array of DataRecord, SDE will align their data items to transform them into a table (in rows and column) structure. SDE doing that by calling alignDataRecords method in a ColumnAligner object. The method will return a two dimensional array of string that contains extracted data items. SDE uses an implementation of Partial Tree Alignment algorithm in the column alignment. We can choose whether to use the similarity of the data items in the column alignment process or not. If we choose to use the similarity of the data items, SDE will use an implementation of Enhanced Simple Tree Matching algorithm instead of Simple Tree Matching in calculating similarity score in the partial tree alignment process. SDE implementation of Enhanced Simple Tree Matching is not a full implementation because it doesn't use visual information as described by Zhai and Liu (2006).

ColumnAligner aligner = null;

if ( useContentSimilarity )
{
   aligner = new PartialTreeAligner( new EnhancedSimpleTreeMatching() );
}
else
{
   aligner = new PartialTreeAligner( matcher );
}

List<String[][]> dataTables = new ArrayList<String[][]>();

for(int tableCounter=0; tableCounter< dataRecords.length; tableCounter++)
{
   String[][] dataTable = aligner.alignDataRecords( dataRecords[tableCounter] );

   if ( dataTable != null )
   {
      dataTables.add( dataTable );
   }
}

Thursday, May 31, 2012

Structured Data Extractor - An Implementation of Data Extraction based on Partial Tree Alignment (DEPTA)

Structured Data Extractor (SDE) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, SDE will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor

Usage

  1. Extract sde.zip.
  2. Make sure that Java Runtime Environment (version 5 or higher) already installed on your computer.
  3. Open command prompt (Windows) or shell (UNIX).
  4. Go to the directory where you extract sde.zip.
  5. Run this command: java -jar sde-runnable.jar URI_input path_to_output_file
  6. You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
  7. The path to output file parameter is formatted as a valid path in the host operating system like "D:\Data\output.html" (Windows) or "/home/seagate/output/output.html" (UNIX).
  8. Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.

Source Code

SDE source code is available at GitHub.

Dependencies

SDE was developed using these libraries:

  • Neko HTML Parser by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0.
  • Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.

License

SDE is licensed under the MIT license.

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.

Wednesday, May 30, 2012

What is Web Mining?

Currently, the World Wide Web (or the Web for short) is a huge information source. Before the Web, finding information means asking other person or looking for it in some books or other kinds of text document. Now, if we need information about something, we can just open a web browser and search it in web search engine like Google. The Web is also a popular communication media. People interact with each other via web forum or social network web site like Facebook and Twitter. Finally, the Web is also an important channel for conducting business. Many companies have used the Web for product campaign or to open online store.

Because of those important uses of the Web, many researches have been conducted to extract useful information from the Web. According to Liu (2007), web mining aim to discover useful information or knowledge from the web hyperlink structure, page content, and usage data. Based on those primary kinds of data used in the mining process, web mining tasks can be categorized into three types: web structure mining, web content mining and web usage mining.

Web Structure Mining

Web structure mining aims to discover useful knowledge from hyperlinks, which represent the structure of the Web. Hyperlink is a link that exists in a web page and refer to another region in the same web page or another web page. The most popular application of web structure mining is to calculate the importance of web pages. This kind of application is used in Google search engine to order its search results. A web structure mining algorithm, PageRank, is invented by Google founders: Larry Page and Sergey Brin. Web structure mining can also be applied to cluster or classify web pages (Gomes and Gong, 2005).

Web Content Mining

Web content mining extracts or mines useful information or knowledges from web page contents. There are two categories of web content mining: structured data extraction and text mining. The idea of structured data extraction is that many web site display important information retrieved from their database using some fixed templates. We can identify those templates by finding repeated patterns in web pages. Apart from structured data, the Web also contain a huge amount of unstructured text, written in natural language. One of the common tasks in text mining is to extract people's opinions or sentiments expressed in product reviews, forum reviews, social networks and blogs.

Web Usage Mining

Web usage mining aims to capture and model behavioral patterns and profiles of users who interact with a web site. Such patterns can be used to better understand the behaviors of different user segments, to improve the organization and structure of the site, and to create personalized experiences for users by providing dynamic recommendations of products and services. Unlike two previous web mining tasks, the primary data source for web usage mining is web server access log, not the web pages.

References

Gomes, M. and Gong, Z., 2005, Web Structure Mining: An Introduction, Proceedings of the 2005 IEEE International Conference on Information Acquisition

Liu, B., 2007, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer