Tuesday, May 18, 2010

jr. webcrawling

Yesterday I was working on a project to retrieve 1st and 2nd degree Twitter followers for an unconference prior to a list being built. These were listed on individual WordPress pages under a single directory. I used a two step process to extract the Twitter handles.

1. I used an old Java tool called websphinx, which gave me the ability to crawl the directory of the site I was looking at, and concatenate each of the pages into one massive page.

2. I posted that page in the sandbox of my site and directed Dapper to it. From there, I was able to create a Dapp identifying the fields I wanted, group them together, and create a CSV document to put into Excel.

This was my first time playing with Dapper, and can definitely see a lot of great uses for it!

Labels:

1 Comments:

Blogger jonvoss said...

adding a possible update to check out instead of this whole process or at least instead of dappr: https://import.io/

April 8, 2014 at 9:42 PM  

Post a Comment

Subscribe to Post Comments [Atom]

<< Home

Copyright 2006 jumpSLIDE networks. All rights reserved.