C# - Web Scraping - a simple HTML Agility Pack example

Hello Devz,

Sometimes it can be useful to copy a part of the content from a website. That’s where web scraping is useful and HTML Agility Pack is one of the best tools to do it. In this tutorial, I will show you a simple HTML Agility Pack example.

Decide what content you need

Say I wanted to have a list of all the countries in the world along with their country codes. It’s possible to do a quick search, find a website listing them and scrape it for the content. Simply open the web page with C# to get the content, find keywords and scrape the data.

Web scraping with this HTML Agility Pack example

HTML Agility Pack is a free and open source tool that is really useful to get the nodes we want from a web page.

In the below code I show you how to do this HTML Agility Pack example to get the country names and codes:

using HtmlAgilityPack;
using System;
using System.IO;
using System.Linq;
using System.Text.RegularExpressions;

namespace WebScraper
{
    class Program
    {
        static void Main(string[] args)
        {
            WebDataScrap();
        }

        public static void WebDataScrap()
        {
            try
            {
                //Get the content of the URL from the Web
                const string url = "http://www.nationsonline.org/oneworld/country_code_list.htm";
                var web = new HtmlWeb();
                var doc = web.Load(url);

                //Get the content from a file
                //var path = "countries.html";
                //var doc = new HtmlDocument();
                //doc.Load(path);

                //Filter the content
                doc.DocumentNode.Descendants()
                                .Where(n => n.Name == "script")
                                .ToList()
                                .ForEach(n => n.Remove());

                const string classValue = "border1";
                var nodes = doc.DocumentNode.SelectNodes($"//*[@class='{classValue}']") ?? Enumerable.Empty<HtmlNode>();

                //Write the desired content to a file
                using (var file = new StreamWriter("test.txt"))
                {
                    foreach (var node in nodes)
                    {
                        //Get the country name
                        var splittedWords = Regex.Split(node.InnerText, "\n");
                        var words = splittedWords
                            .Where(x => !x.Contains("&nbsp;") && !string.IsNullOrEmpty(x.Trim()))
                            .ToList();

                        if (words.Count() != 4) continue;

                        var countryName = words[0].Trim();
                        var countryCode = words[2].Trim();
                        var result = $"{countryName};{countryCode}";

                        file.WriteLine(result);
                        Console.WriteLine(result);
                    }
                }

                Console.WriteLine("\r\nPlease press a key...");
                Console.ReadKey();
            }
            catch (Exception ex)
            {
                Console.WriteLine($"An error occured:\r\n{ex.Message}");
            }
        }
    }
}

Note about CSS classes

Of course the way to get the content of a web page will depend on the page itself. This code can’t be generic, but will generally depend on CSS classes name used.

Happy web scraping! 🙂

C# – Web Scraping – a simple HTML Agility Pack example

Decide what content you need

Web scraping with this HTML Agility Pack example

Note about CSS classes

Related posts: