The Web of Knowledge: Differences Between SDE and DEPTA

Saturday, July 14, 2012

Differences Between SDE and DEPTA

Even though SDE is based on DEPTA, it has some differences with DEPTA in the following parts:

Tag Tree Building

DEPTA uses the sequence of opening tags and their rendered rectangle in the tag tree building, whereas SDE uses HTML DOM parser (Neko HTML Parser). Neko HTML Parser is used to build HTML DOM tree, then SDE will build tag tree based on the previously created HTML DOM tree.

Visual Information

DEPTA exploits visual information in the tree matching and gap checking between data records. SDE doesn't use visual information. Therefore, unlike DEPTA, SDE is unable to extract non-contiguous data records.

Similarity Score Calculation Between Data Records Candidates

In the similarity score calculation between data records candidates, DEPTA only considers tags that contain text. On the other hand, SDE considers all tags in the data records candidates.

SDE can accept the following parameters in order to extract structured data:

similarity threshold: The minimum similarity score between tag trees/subtrees to determine whether those trees/subtrees are generalized nodes/data records or not.
maximum number of nodes in generalized nodes
whether to use content similarity in partial tree alignment or not
whether to ignore formatting tags like B, I, U or not in tag tree building

The Web of Knowledge

Saturday, July 14, 2012

Differences Between SDE and DEPTA

No comments:

Post a Comment