Hello Devz,
Sometimes it can be useful to copy a part of the content from a website. That’s where web scraping is useful and HTML Agility Pack is one of the best tools to do it. In this tutorial, I will show you a simple HTML Agility Pack example.
Decide what content you need
Say I wanted to have a list of all the countries in the world along with their country codes. It’s possible to do a quick search, find a website listing them and scrape it for the content. Simply open the web page with C# to get the content, find keywords and scrape the data.
Web scraping with this HTML Agility Pack example
HTML Agility Pack is a free and open source tool that is really useful to get the nodes we want from a web page.
In the below code I show you how to do this HTML Agility Pack example to get the country names and codes:
using HtmlAgilityPack; using System; using System.IO; using System.Linq; using System.Text.RegularExpressions; namespace WebScraper { class Program { static void Main(string[] args) { WebDataScrap(); } public static void WebDataScrap() { try { //Get the content of the URL from the Web const string url = "http://www.nationsonline.org/oneworld/country_code_list.htm"; var web = new HtmlWeb(); var doc = web.Load(url); //Get the content from a file //var path = "countries.html"; //var doc = new HtmlDocument(); //doc.Load(path); //Filter the content doc.DocumentNode.Descendants() .Where(n => n.Name == "script") .ToList() .ForEach(n => n.Remove()); const string classValue = "border1"; var nodes = doc.DocumentNode.SelectNodes($"//*[@class='{classValue}']") ?? Enumerable.Empty<HtmlNode>(); //Write the desired content to a file using (var file = new StreamWriter("test.txt")) { foreach (var node in nodes) { //Get the country name var splittedWords = Regex.Split(node.InnerText, "\n"); var words = splittedWords .Where(x => !x.Contains(" ") && !string.IsNullOrEmpty(x.Trim())) .ToList(); if (words.Count() != 4) continue; var countryName = words[0].Trim(); var countryCode = words[2].Trim(); var result = $"{countryName};{countryCode}"; file.WriteLine(result); Console.WriteLine(result); } } Console.WriteLine("\r\nPlease press a key..."); Console.ReadKey(); } catch (Exception ex) { Console.WriteLine($"An error occured:\r\n{ex.Message}"); } } } }
Note about CSS classes
Of course the way to get the content of a web page will depend on the page itself. This code can’t be generic, but will generally depend on CSS classes name used.
Happy web scraping! 🙂