"Maybe we can use data from the Internet?”
Have you ever wondered that? In my recent experience, this question is coming up more and more. After all, the Internet has so much incredible information, if only it could be downloaded and processed – just think of how valuable it could be?
In this blog, I’ll discuss multiple web data mining use cases that support business intelligence and analytics. Let’s first have a high-level look at some business needs for extracting web data and how to identify the right data for your requirements.
Why use web data mining for business intelligence?
A fast-growing field, web data mining can provide business intelligence to help drive sales, understand customers, meet mission goals, and create new business opportunities. At Accenture, we help clients mine data from the Internet for a wide variety of use cases. Here are some examples:
Learn more about your customer:
- What is the CEO of your customer’s company saying?
- What are your customer’s financial situation and key initiatives?
- What are your customers tweeting and posting about recently?
Learn more about your competitors:
- What are your competitors doing?
- What are they selling?
- Are they doing anything new? Unique?
Find new customers and sales targets:
- What’s happening in the world?
- Where should you target your sales?
Learn more about the government:
- What rules and regulations affect your company?
- What is the government thinking about doing that might affect your business?
- What are available grants and business opportunities?
Find things that are being sold and the people who sell them:
- To compare prices
- To look for new business opportunities
- To look for illegal activities and things which should not be sold
Supplement your internal offerings with external content:
- So users can “stay inside” your offering without having to consult external databases
Translate between external language and internal language:
- Often, the words and phrases used by your community are different than the ones used inside your company
- Consulting external sources can help “translate” between external language and internal language
Monitor what people are saying about your business:
- Identify and mitigate potential customer issues before they go viral
- Track the effectiveness of your ad campaigns
- Track product and brand activity and sentiment
Web data mining applications
Some example applications in the enterprise include:
- Scan through news articles to identify other companies' strategic plans
- Use 10-K documents, shareholder meetings, and annual reports to identify key company initiatives
- Search and flag illegal or suspicious activities
- Gather mining jobs and mining articles for industry-focused web search applications
- Automatically find and analyze government rules and regulations
- Automatically find and analyze speeches and public statements
- Mine the web for specific industries' rules and regulations
- Identify and track conferences with locations and organizations
Useful data sources for your web data mining project
For most of us, it’s impractical to download all the data on the web. Therefore, you must first identify the data sources you want to target. Data, of course, covers a very wide range of quality, volume, applicability, and accessibility.
- Curated public sources: Wikipedia (available in convenient XML dump files), Wikidata, and Wiktionary
- Social media sources: Twitter, Facebook, Reddit, Instagram, Pinterest, Google+
- Government data: U.S. Government Publishing Office, United States Code, Data.gov
- Medical and health: Medline, MESH, CPT and ICD codes
- Company content: company websites can be crawled with web crawlers (Wikidata is a good starting point to find website addresses); financial performance data can also be found on AnnualReports.com and EDGAR
- Third-party aggregators: Thomson Reuters, Factiva, NewsCred, and LexisNexis. These are websites that provide data for a fee. All have APIs for searching, filtering, and downloading content. Their available data includes news stories (from large and small news organizations around the world, both global and local), company reports, annual reports, financial filings, worldwide patents, marketing and market reports, corporate communications and so on.
- Niche websites: Stack Exchange (e.g. Stack Overflow which has data dumps and an API), Github, and others
- Coding: these are often good starting points for content analysis
- The World Wide Web:
- You can manually identify web pages (“seed URLs”) for a crawler to crawl
- You can get a set of websites from a search engine, for example, Bing or Google Custom Search (note that there is a cost for more than 100 or so searches per day). The websites returned by these search engines can then be crawled with a web crawler
- You can also get seed URLs from other data sets, such as Wikidata, Twitter, and Reddit
Once you’ve identified the sources from which you need the data, the next step would be using available data mining tools and techniques for acquiring the content effectively. I’ll discuss this step in my next blog. Stay tuned!