Saturday, July 14, 2012

Differences Between SDE and DEPTA

Even though SDE is based on DEPTA, it has some differences with DEPTA in the following parts:

  1. Tag Tree Building
  2. DEPTA uses the sequence of opening tags and their rendered rectangle in the tag tree building, whereas SDE uses HTML DOM parser (Neko HTML Parser). Neko HTML Parser is used to build HTML DOM tree, then SDE will build tag tree based on the previously created HTML DOM tree.

  3. Visual Information
  4. DEPTA exploits visual information in the tree matching and gap checking between data records. SDE doesn't use visual information. Therefore, unlike DEPTA, SDE is unable to extract non-contiguous data records.

  5. Similarity Score Calculation Between Data Records Candidates
  6. In the similarity score calculation between data records candidates, DEPTA only considers tags that contain text. On the other hand, SDE considers all tags in the data records candidates.

SDE can accept the following parameters in order to extract structured data:

  1. similarity threshold: The minimum similarity score between tag trees/subtrees to determine whether those trees/subtrees are generalized nodes/data records or not.
  2. maximum number of nodes in generalized nodes
  3. whether to use content similarity in partial tree alignment or not
  4. whether to ignore formatting tags like B, I, U or not in tag tree building