Data Big Bang Blog

Creativity and Problem Solving for Data Science (whatever it may mean...) | An experimental spin-off from Nektra Advanced Computing

Latest articles

The Call of the Web Scraper

Astrid, our Data Big Bang and Nektra content editor, is heading to Nepal on a birding and trekking quest. She needs birds sounds from xeno-canto and The Internet Bird Collection to identify the hundreds of species found in Nepal, but the site does not offer batch downloads. We could not pass up the opportunity to offer a useful scraper for birders....

Web Scraping 101: Pulling Stories from Hacker News

This is a guest post by Hartley Brody, whose book “The Ultimate Guide to Web Scraping” goes into much more detail on web scraping best practices. You can follow him on Twitter, it’ll make his day! Thanks for contributing Hartley! Hacker News is a treasure trove of information on the hacker zeitgeist. There are all sorts of cool things you could do with...

Scraping Web Sites which Dynamically Load Data

Preface More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped? In the previous article we played with Google Chrome...

Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control...

Scraping for Semi-automatic Market Research

It is easy to scrape Microsoft TechNet Forums and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on. This code can be used to scrape any of Microsoft TechNet’s forum. In the example below we scraped the...

Letters from the Future: Challenging Google’s Search Engine

A previous version of this article was posted on Duck Duck Go reddit, where the user _zekiel pointed out that DDG currently uses the two-level search we had proposed. Google is the undisputed search leader (88% market share in the US1). Google is not only ahead of competitors in terms of quality of search results, infrastructure worthy of science fiction,...

Parsing S-Expressions in C# using OMeta

It is easy to parse S-Expressions in C# with OMeta. Our code limits the grammar to lists, and atoms of string, symbol, and number types. So, it is not complete, but it can easily be expanded with OMeta. What motivated me to write this article was the lack of publicly available S-Expression parsers in C#/.NET. Our parser converts the expression (+ (*...

Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe

It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages. Streams are...

Enriching a List of URLs with Google Page Rank

Dealing with a large body of web resources can be daunting. You make a list of hundreds of blogs, but how do you share or recall those resources later? You must somehow organize your list. Many people do this with tags, but this is not necessarily the best option. Manual organization is also tedious, so tools for enriching data automatically came in...

Esoteric Queue Scheduling Disciplines

New Challenges Requires New Tools Big Data challenges current message oriented middleware (MOM) applications. MOM usually works with FIFO and priority scheduling disciplines. What happens if there is a large list of URLs ready to be crawled but you want to give URLs at the end of the list a chance of being crawled earlier? This concept comes from genetics,...

Discover, share and read the best on the web

Subscribe to RSS Feeds, Blogs, Podcasts, Twitter searches, Facebook pages, even Email Newsletters! Get unfiltered news feeds or filter them to your liking.

Get Inoreader
Inoreader - Subscribe to RSS Feeds, Blogs, Podcasts, Twitter searches, Facebook pages, even Email Newsletters!