Posted September 12, 2012
Edit: better version
Getting unique username list for a thread in Linux
for i in {1..$maxPage}; do curl -s $threadurl/page$i; done | grep "<div class=\"small_user_name\">" | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//' | sort | uniq
exchange
$maxPage wih the last pagenumber
$threadUrl with the url to the thread
Example: for i in {1..33}; do curl -s http://www.gog.com/en/forum/general/introducing_the_beta_release_of_the_new_gogcom_downloader/page$i; done | grep "<div class=\"small_user_name\">" | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//' | sort | uniq
Pseudocode:
1. Get me all pages of the thread
2. Get me all the users in those pages
3. make it pretty (no html garbage)
4. Sort and remove multiple entries
Explanation:
1. for i in {1..$maxPage}; do curl -s $threadurl/page$i; done
tells curl to get page 1 - $maxPage of the thread identified by $threadUrl
2. | grep "<div class=\"small_user_name\">"
throws away every line that doesn't include <div class=\"small_user_name\"> (thanks to hedwards for pointing it out)
3. | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//'
Cuts away unnecessary text in front of and after the username
4. | sort | uniq
Sorts the usernams alphabetically and removes multiple entries
Getting unique username list for a thread in Linux
for i in {1..$maxPage}; do curl -s $threadurl/page$i; done | grep "<div class=\"small_user_name\">" | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//' | sort | uniq
exchange
$maxPage wih the last pagenumber
$threadUrl with the url to the thread
Example: for i in {1..33}; do curl -s http://www.gog.com/en/forum/general/introducing_the_beta_release_of_the_new_gogcom_downloader/page$i; done | grep "<div class=\"small_user_name\">" | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//' | sort | uniq
Pseudocode:
1. Get me all pages of the thread
2. Get me all the users in those pages
3. make it pretty (no html garbage)
4. Sort and remove multiple entries
Explanation:
1. for i in {1..$maxPage}; do curl -s $threadurl/page$i; done
tells curl to get page 1 - $maxPage of the thread identified by $threadUrl
2. | grep "<div class=\"small_user_name\">"
throws away every line that doesn't include <div class=\"small_user_name\"> (thanks to hedwards for pointing it out)
3. | sed 's/^[ \t]*<div class="small_user_name">//' | sed 's/<\/div>//'
Cuts away unnecessary text in front of and after the username
4. | sort | uniq
Sorts the usernams alphabetically and removes multiple entries
Post edited September 12, 2012 by fengor