Extractors

Extractor components are responsible for parsing subdomain addresses from any Content object

The extractor components already implemented in Subscan are as follows

  • HTMLExtractor

    Extracts subdomain addresses from inner text by given XPath or CSS selector

  • JSONExtractor

    Extracts subdomain addresses from JSON content. JSON parsing function must be given for this extractor

  • RegexExtractor

    Regex extractor component generates subdomain pattern by given domain address and extracts subdomains via this pattern

Create Your Custom Extractor

Each extractor component should be implemented following the interface below. For a better understanding, you can explore the docs.rs page and review the crates listed below

#[async_trait]
#[enum_dispatch]
pub trait SubdomainExtractorInterface: Send + Sync {
    // Generic extract method, it should extract subdomain addresses
    // from given Content
    async fn extract(&self, content: Content, domain: &str) -> Result<BTreeSet<Subdomain>>;
}

Below is a simple example of a custom extractor. For more examples, you can check the examples/ folder on the project's GitHub page. You can also refer to the source code of predefined requester implementations for a better understanding

pub struct CustomExtractor {}

#[async_trait]
impl SubdomainExtractorInterface for CustomExtractor {
    async fn extract(&self, content: Content, _domain: &str) -> Result<BTreeSet<Subdomain>> {
        let subdomain = content.as_string().replace("-", "");

        Ok([subdomain].into())
    }
}