jr. webcrawling
Yesterday I was working on a project to retrieve 1st and 2nd degree Twitter followers for an unconference prior to a list being built. These were listed on individual WordPress pages under a single directory. I used a two step process to extract the Twitter handles.
1. I used an old Java tool called websphinx, which gave me the ability to crawl the directory of the site I was looking at, and concatenate each of the pages into one massive page.
2. I posted that page in the sandbox of my site and directed Dapper to it. From there, I was able to create a Dapp identifying the fields I wanted, group them together, and create a CSV document to put into Excel.
This was my first time playing with Dapper, and can definitely see a lot of great uses for it!
Labels: networkmapping experiments