Quote of the Day

more Quotes

Categories

Get notified of new posts

Buy me coffee

  • Home>
  • C#>

Web scraping in C# using HtmlAgilityPack

Published October 16, 2022 in .NET , .NET core , C# - 0 Comments

In this post, I show an example of scraping data in C# using HtmlAgilityPack. I come across HtmlAgilityPack because I need to get data from Zillow to analyze properties deals. I was able to scrape the data I want without much trouble using HtmlAgilityPack with a bit of XPath, LINQ and regular expression.

Below I show as screenshot of a sample Zillow listing page which contains the data I want to scrape.

Sample Zillow listing detail page

I want to scrape the data under Facts and Features. The process is simple using .NET HttpClient and HtmlAgilityPack. First, I stream the HTML content. Then, I use HtmlAgilityPack to parse the document and extract the data using XPATH.

Stream HTML

It is quite easy to stream the HTML of a Zillow listing page using .NET HttpClient, as shown in the below code snippet.

 public class ZillowClient : IZillowClient
    {
        private HttpClient _httpClient;

        public ZillowClient(HttpClient httpClient)
        {
            _httpClient = httpClient;
        }

        public Task<string> GetHtml(string address)
        {
            return _httpClient.GetStringAsync(BuildUrl(ZillowUtil.NormalizeAddress(address)));
        }

        private string BuildUrl(string address)
        {
            return @$"https://www.zillow.com/homes/{address}";
        }
    }

In the above codes, I use GetStringAsync method to download the html content into a string in one line. Below shows an example of a URL which Zillow understands: https://www.zillow.com/homes/7777-Alder-Ave-Fontana-CA-92336. I replace the spaces and punctuation marks in the address with hyphen, as shown in the below code snippet:

   public static string NormalizeAddress(string address)
        {
            return Regex.Replace(address.Replace(",", " "), @"\s+", " ").Replace(" ", "-");
        }

Parsing data using HTMLAgilityPack and XPATH

Once I have the HTML content, I load it into a HTMLDocument object and extract the data into a model I can use.

public ListingDetail Parse(string html)
        {
            HtmlDocument htmlDoc = new HtmlDocument();
            htmlDoc.LoadHtml(html);

            var listingDetail = new ListingDetail();
            UpdateListingDetailWithFacts(htmlDoc, listingDetail);
            return listingDetail;
        }

HtmlAgilityPack exposes methods to extract data using XPath. For instance, below shows the element that contains the listing price of the property in the screenshot above.

<span data-testid="price" class="Text-c11n-8-73-0__sc-aiai24-0 dpf__sc-1me8eh6-0 kGdfMs fzJCbY">
  <span>$750,000</span>
</span>

The listing price is the text of the <span> element under the parent <span> element which has the attribute “data-testid”. Below snippet demonstrates extracting the listing price from the HtmlDocument object using XPATH.

private decimal ParseListingPrice(HtmlDocument htmlDoc)
        {
            var listingPriceElement = htmlDoc.DocumentNode.SelectSingleNode("//span[@data-testid=\"price\"]/span[1]");
            if (listingPriceElement != null)
            {
                var listingPriceText = listingPriceElement.InnerHtml;
                return decimal.Parse(listingPriceText.Replace("$", ""));
            }
            return 0;
        }

Besides the listing price, it’s also easy to parse other info under Facts and Features such as number of bedrooms, bathrooms, year built etc… For instance, by inspecting the HTML elements in the browser, I notice all the info I need are wrapped in <span> elements. Furthermore, the text under those <span> elements have this format: label: {data}. For example, here is the <span> element for Bedrooms:

<span class="Text-c11n-8-73-0__sc-aiai24-0 kHeRng">Bedrooms: 4</span>

To extract the number of bedrooms, one easy way is to just get all the <span> elements, use LINQ to filter the elements to get the one for Bedrooms, and extract the info using a simple regular expression.

private void GetNumOfBeds(HtmlDocument htmlDoc)
        {
            var spanElements = htmlDoc.DocumentNode.SelectNodes("//span"); 
            var numOfBedsElement = spanElements.Where(element => element.InnerHtml.Contains("Bedrooms: ")).First(); 
            var match = Regex.Match(numOfBedsELement.InnerHtml.Replace(",", ""), @"\d+").Value;
            return int.Parse(match);
        } 

Since the <span> elements have the same format, I can make the codes more generic and reusable using helper methods.

 private void UpdateListingDetailWithFacts(HtmlDocument htmlDoc, ListingDetail listingDetail)
        {
            listingDetail.ListingPrice = ParseListingPrice(htmlDoc);
            var spanElementsUnderFactsAndFeatures = htmlDoc.DocumentNode.SelectNodes("//span");

            if (spanElementsUnderFactsAndFeatures != null && spanElementsUnderFactsAndFeatures.Count > 0)
            {
                listingDetail.NumOfBedrooms = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                element => element.InnerHtml.Contains("Bedrooms: "));
                listingDetail.NumOfBathrooms = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                element => element.InnerHtml.Contains("Bathrooms: "));
                listingDetail.NumOfStories = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                    (element => element.InnerHtml.Contains("Stories: ")));
                listingDetail.NumOfParkingSpaces = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Total spaces: "));
                listingDetail.LotSizeInSqrtFt = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Lot size: "));
                listingDetail.NumOfGarageSpaces = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Garage spaces: "));
                listingDetail.HomeType = ExtractTextFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Home type: "))?.Replace("Home type: ", "");
                listingDetail.PropertyCondition = ExtractTextFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Property condition: "))?.Replace("Property condition: ", "");
                listingDetail.YearBuilt = ExtractNumFromFirstNode(spanElementsUnderFactsAndFeatures,
                    element => element.InnerHtml.Contains("Year built: "));
                listingDetail.HasHOA = ParseHasHOA(spanElementsUnderFactsAndFeatures);
            }
        }

private int ExtractNumFromFirstNode(HtmlNodeCollection nodeCollection, Func<HtmlNode, bool> predicate)
        {
            var filteredNodes = nodeCollection.Where(predicate);
            if (filteredNodes == null || filteredNodes.Count() == 0)
            {
                return 0;
            }
            var match = Regex.Match(filteredNodes.First().InnerHtml.Replace(",", ""), @"\d+").Value;
            if (match == null)
            {
                return 0;
            }
            return int.Parse(match);

        }

  private string? ExtractTextFromFirstNode(HtmlNodeCollection nodeCollection, Func<HtmlNode, bool> predicate)
        {
            var filteredNodes = nodeCollection.Where(predicate);
            if (filteredNodes == null || filteredNodes.Count() == 0)
            {
                return null;
            }
            return filteredNodes.First().InnerHtml;
        }

Hopefully, you find this post helpful should you need to scrape data from Zillow or other sites. Happy coding.

References

Web Scraping with C# | ScrapingBee
Html Agility Pack (html-agility-pack.net)

XPath Tutorial (w3schools.com)

No comments yet