Instructions for using the pageranks:
I have put sample code in the webbase directory. Take a look at getpagerank.c. It will work on lilblue since the pageranks are there. Note the swapshort conversion that converts little endian to bigendian. If you run this on a PC don't use swapshort.
Also: I have made the data link in the webbase dir automatically point to the right place depending on what machine you are using.
--Sergey
Here's the deal: -- no group has a quota, except for the actual diskspace available. If you want to know how much disk-space is left, type "du ." at the command-line. -- there is a filesize limitation on lilblue. No one file can be bigger than 1 GB. However, you can have multiple files, each smaller than 1 GB. (within the constraint of the diskspace available).
I'll look into to freeing up more space on the disks that the group home directories are on, but you should also be conscious of other groups vieing for the same resources, and remove any files that you no longer have any need of.
--D ------------------------------------------------ Tuesday, December 1 4:15: asdf 4:30: cmc 4:45: learn 5:00: scriptfinder 5:15: vamp 5:30: vls 5:45: webtek
Thursday, December 3 4:15: arcmakers 4:30: baconbar 4:45: cat5 5:00: etaoin 5:15: lkr 5:30: tigers 5:45: xyyy 6:00: sbs Hi class,
I have received several questions about process "sleeping". This is normal behavior when process reaches the end of the webbase. This is because it is designed to keep running while the webbase is growing.
I'll try to put in a patch to turn that feature off.
Regards, --Sergey
My apologies for various system problems today. All kinds of things went wrong.
lilblue should be back up and accepts ssh and telnet (as a backup).
The search server should be back up and running on palo:3491
Various groups have been asking us for various additional resources. Unfortunately it is difficult for us to meet a lot of these requests but we will do our best so feel free to ask. Just don't expect a quick response.
Regards, --Sergey
I have received lots of notes about problems with trout and lilblue. I think they are simply not handling all the load well.
I have just made "palo.stanford.edu" available. It should be quite fast and has all the data available. Note it is the same endianess as trout so be careful if reusing code from lilblue.
Your normal login+pass should work. Also there is a dir /disk/a that has a dir for each group to put data. Note that there are only 4 GB or so available free there total for all groups.
Soon, I will make alto available as well.
Let me know if you continue to have machine problems. Also, be sure to back up your source code to another account as trout is not getting backed up.
--Sergey
Remember, use data dir /home/webbasedata. Also note that palo, alto, and trout use Intel byte order AKA vax byte order whereas lilblue uses network byte order.
--Sergey
Note: /home/webbasedata should now be your data dir on all machines since that is a soft link that points to the right place.
--Sergey
------ Forwarded Message
Return-Path: kenta... Delivery-Date: Fri Nov 27 15:11:57 1998 Received: from elaine14.Stanford.EDU (elaine14.Stanford.EDU [171.64.15.79]) by DB.Stanford.EDU (8.8.8/8.8.8) with ESMTP id PAA09602 for <sergey...>; Fri, 27 Nov 1998 15:11:56 -0800 Received: (from kentalocalhost) by elaine14.Stanford.EDU (8.8.8/8.8.7) id PAA29158; Fri, 27 Nov 1998 15:11:56 -0800 (PST) Date: Fri, 27 Nov 1998 15:11:56 -0800 (PST) From: Ken Taiyo Takusagawa <kenta...> Sender: kenta... To: "'Sergey Brin" <sergey...> Subject: Re: [cs349] Access to alto now available In-Reply-To: <199811272301.PAA02746...> Message-ID: <Pine.GSO.3.96.981127150722.29048A-100000...> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII
I FTPed the repository.* files over the the leland machines: elaine10..elaine22 in the dir /tmp/repnon. (someone erased the contents of elaine13)
Ive also found the Perl is quite fast. The following script (on the 360Mhz elaines) can do 1300 doc/sec.
Ken
#! perl -w use Compress::Zlib; $count=0; while(1) { $count++; if ($count %1000==0) {print "$countn";} read STDIN,$s,4; while (!($s eq "xb8xd9x01x00")) { die "$count" unless read STDIN,$f,1; $s=substr($s,-3).$f; print STDERR "."; } read STDIN,$s,4; $len=unpack("V",$s); # print "$lenn"; read STDIN,$build,$len; $x=inflateInit(); ($output, $status) = $x->inflate($build) ;
# print $output if $status == Z_OK or $status == Z_STREAM_END ; }
#last if $status!= Z_OK ;
If you are not planning on doing your presentation in HTML, please email me what format your presentation will be in, and preferably, where I can download it so that I can get your presentation onto my laptop before class so that we don't have to spend time during class trying to download it.
--D
I also created a directory for each group on /trout/b, which seems to have almost 5 GB of space available. You can use that for data storage as well.
--D
Hi class. Here are the quiz stats and answers forwarded from Diane:
From: Diane Tang <dtang...> Subject: Quizzes Date: Sat, 28 Nov 1998 17:03:31 -0800
Quiz 1:
Question 1: 4, 9, 5 Question 2: <1, 1>
Average score (among the 27 people who took it): 9.4
Quiz 2:
Support(tea, coffee) = 0.01 Support(tea, jam) = 0.02 Support(coffe, jam) = 0 Confidence(tea->coffee) = 0.1 Confidence(coffee->tea) = 0.05 Confidence(tea->jam) = 0.2 Confidence(jam->tea) = 1 Confidence(coffee->jam) = - = 0 Confidence(jam->coffee) = 0
Given threshold levels, tea->jam and jam->tea holds, as well as tea,coffee and tea,jam
With implication threshold, only thing that holds is jam->tea (implication = infinity)
Pros of LSI: -- saves space -- handles "concepts"
Pros of Wordlist: -- simple to implement + understand -- easy to create indeces
Average score (among the 41 people who took it): 6.9
You can pick up your quizzes from me during my office hours, or after class on Tuesday.
--D
Since I have gotten many many emails asking for this, I have added url2docid.c to the webbase.
It is a sample program that converts urls to docids. It will work on palo and alto since they have the necessary indeces.
--Sergey
Here are the changes:
addition to makefile:
url2docid: url2docid.o urlhash.o debug.o
url2docid.c:
#include <stdio.h> #include "search.h"
FILE *checksfp = NULL; int numurls;
unsigned long url2docid(char *purl) { unsigned int hi,lo; struct UrlChecksum cs; char *url; unsigned int i,j,k; int r;
if (!checksfp) { checksfp = fopen(CHECKSTOIDSFN,"r"); LOG(("Checksfp %d", checksfp)); fseek(checksfp, 0, SEEK_END); numurls = ftell(checksfp)/sizeof(struct UrlChecksum); LOG(("numurls %d", numurls)); }
/* ignore a leading http:// if present */ if(!strncmp(purl, "http://", 7)) { url = purl+7; } else { url = purl; }
hi = checksum(url); lo = checksum2(url);
i = 0; j = numurls;
while(i<=j) { k = (i+j) / 2; /* printf_stderr("trying i=%d j=%d m=%dn", i,j,k);*/ r = fseek(checksfp, k*sizeof(struct UrlChecksum), SEEK_SET); if(r!= 0) { LOG(("Couldn't seek checks")); return 0; } r = fread(&cs, 1, sizeof(struct UrlChecksum), checksfp); if(r!= sizeof(struct UrlChecksum)) { LOG(("Couldn't read checks")); return 0; } /* LOG(("%u %u %u %u (%u %u %u)n", hi, lo, cs.hi, cs.lo, i, j, k));*/
if(cs.hi == hi) { if(cs.lo == lo) return cs.docid; else if(cs.lo < lo) i = k+1; else j = k-1; } else { if(cs.hi > hi) j = k-1; else i = k+1; } } return 0; }
int main() { char url[1024];
chdir("data");
while (1) { printf("url> "); fflush(stdout); gets(url); printf("docid: %dn",url2docid(url)); } }
I have modified repository.cc so that process no longer sleeps when it reaches the end of the repository.
--Sergey
I have created the following directories on alto you can use for disk space:
/disk/a/cs349 /disk/d/cs349 /disk/g/cs349 /disk/j/cs349 /disk/m/cs349 /disk/b/cs349 /disk/e/cs349 /disk/h/cs349 /disk/k/cs349 /disk/o/cs349 /disk/c/cs349 /disk/f/cs349 /disk/i/cs349 /disk/l/cs349 /disk/q/cs349
I recommend you create a subdirectory under one of those directories to do your work.
You can get to these from either palo or alto by using /rdisk/alto/...
Also you can get to the /disk/a dir on palo using /rdisk/palo/a.
Also note that the webbase data is local to alto so you get better performance there.
--Sergey
> Well, if I use gcc then I get the following errors because I am using C++ > iostream and fstream:
You should use gcc to compile the .c files, g++ for .cc files, and g++ to link it all.
--Sergey
> > /home/vamp/webbase/handlers.cc:261: undefined reference to `endl(ostream&)' > /home/vamp/webbase/handlers.cc:261: undefined reference to `cerr' > /home/vamp/webbase/handlers.cc:261: undefined reference to > ostream::operator<<(int)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(int)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(int)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:261: undefined reference to > `ostream::operator<<(ostream &(*)(ostream &))' > /home/vamp/webbase/handlers.cc:265: undefined reference to > `ofstream::ofstream(int, char const *, int, int)' > /home/vamp/webbase/handlers.cc:266: undefined reference to`ios::operator > void *(void) const' > /home/vamp/webbase/handlers.cc:268: undefined reference to > `ostream::operator<<(char const *)' > /home/vamp/webbase/handlers.cc:269: undefined reference to > `fstreambase::close(void)' > > -Vikash >
[cs349] Presentations
So there's still no word on whether or not we'll have a net connection tomorrow. Someone said that they would try to get that setup this morning, and I haven't heard from him since.
So there are 3 possible technologies you can use: -- the overhead projector -- my laptop: -- powerpoint -- Netscape Navigator 4.x (not sure, probably 4.0) -- your own laptop
If you're planning on using my laptop, please email me where I can download your presenation so just in case we don't have a net connection, you'r not extremely unhappy.
Also, be sure to practice your presentation. 15 minutes is the maximum time allowed. If you go under, most likely everyone will be happy. If you go over, that is much less likely to be the case.
--D
Groups presenting tomorrow: Tuesday, December 1 4:15: asdf4:30: cmc 4:45: learn 5:00: scriptfinder 5:15: vamp 5:30: vls 5:45: webtek
> Dear Sergey, > > > Would it be possible for you to do your presentation as scheduled on Tuesday > > but to have the project completely finished by the following Sunday? > We would like very much to be able to do the presentation on > Tuesday. Unfortunately, it is impossible for us to have the > necessary statistics available by that day, and the demo will not > be able to show the effect that we intend to have. In other words, > we really cannot do the presentation on Tuesday.
We're working to see if we can shift around the presentation schedule. Unfortunately there are several groups in the same boat as you. We would prefer if you did your presentation on Tuesday (we will make it clear that you and several other groups were delayed by machine problems) and we will give you plenty of time to finish up the project and show us your results later.
--Sergey
Note that we have noticed that palo and alto have crashed. We are working to fix them within two hours.
If you have any suspicions about running something that might have caused them to crash please contact staff before you run it again.
--Sergey Palo and Alto are back up after a crash unlike any I have ever seen before.
I think that future versions of this class will be also be used as the ultimate stability test for hardware and software.
Lilblue seems to have magically recovered, but I don't recommend you trust it.
I have taken the opportunity to create a lilblue-compatible data directory on alto in /home/lilbluedata. This means you can use it for your data directory and repository pointers computed on lilblue will work.
I hope everyone will get there projects running again.
Note that due to all the chaos with the machines, the groups presenting tomorrow (today I should say) will not be expected to be completely finished with their projects.
Good luck, --Sergey
We have networking in the room now. Networking services had to make an emergency housecall, but all is fine now.
--D
This is a little late for those presenting on Tuesday, I know, so don't worry about it too much, but do realize this: you only have 15 minutes. One good rule of thumb is that one slide (especially if you're tired) takes about 2 minutes to present. If you speak really fast or don't have a lot on your slide, then maybe a minute and a half. So figure that you should have about 7-10 slides, max.
If you've already made your slides (and the ones that I've seen *do* look nice), you might want to go over your presentation and practice, and see what you can cut out if time runs short. I'll try to bring a watch and give every group presenting a 5 minute warning if you think that would be helpful.
With regards to the technology present, I will bring a PC laptop, with a network connection, Powerpoint, and Netscape. I will also bring a floppy drive, but I make no guarantees about it working.
If you're presenting on Thursday in Powerpoint, please make sure to let me know where I can download your slides – things will go much more smoothly if all the files are on my laptop in advance.
Thanks! --D
A bunch of people have asked so I found a way to translate docids to urls.
You need to do: telnet palo 3491 s 1 id:<docid>
and you will get a result line back with a url, title and so forth.
Another way is to: telnet palo 3491 r foo <docid> <docid> <docid> 0 c
And you will also get various info back. Probably the latter is better.
--Sergey
Forget Jack Frost! Come to the Tropics.
What: Larry, Sergey and the rest of the Googles invite you to the first celebration at Google, Inc. ************ Fri. Dec. 11th 6PM to 10PM ***************** Dress is Tiki Lounge wear and bring something for the hot tub. Friends are welcome.
Why: 1. Google: The Stanford Research Project is now Google.com: The Next Generation Internet Search Company. (And we've been in business 90 web years already.) 2. We have great new people in our team. 3. Our alpha at www.google.com is up and running. 4. We plan to IPO next week. (just kidding) 5. We need more people before we can IPO. Resumes welcome. 6. In appreciation of all the great people who have helped us out over the years. 7. We have a hot tub.
Where: Google's Temporary World Headquarters 232 Santa Margarita Ave. Menlo Park CA
Tel (650) 330-0100 or party.... Regrets only.
Directions from 101 Take the Willow exit south toward Menlo Park. Go through several traffic lights, and take a right at the light at Middlefield. The immediate first right is Santa Margarita (right after the restaurant). 232 is 2/3 down the block on the right. Park on the street and walk down the long driveway between two houses.
http://www.mapblast.com/mapblast/blast.hm?CMD=GEO&xx=1&id=9125909267237&AD2= 232+Santa+Margarita&AD3=Menlo+Park %2C+Ca&IC=%3A%3A8&IC%3A=Google +World+Headquarters
[ Please string this together yourself – Sam]
It will be great to see you there!
Due to the problems with the computers, all of the groups have an extension on the paper deadline until Sunday, December 6.
There was a breakin last night on palo and alto about 10:00, and trout around 3AM 10/4. If you typed any passwords in clear text in or out, they may have been comprimised, and you should consider changing them.
The breakin was due to a mountd problem on linux – all the machines I checked had been broken in to (and all my friends machines too). So if you have a linux box, try more /.bash_history, and if you see any lkr type stuff, your machine has been broken in too.
The machines seem fine for now, but try to get your projects done soon!
I'll let you know more as I find it out.
-Larry
Due to the security problems, the machines will be going down sometime after midnight tonight or tomorrow morning. Let me know if that is a problem for you ASAP.
-Larry
It is possible we will be in a different room tomorrow. I am awaiting confirmation from the registrar's office. It will be larger and without the noise (I hope).
If we do change rooms, I will send out an email before the class and leave a note on the door. The new classroom will be nearby so don't worry about having to run across campus.
Tomorrow we will start to cover data mining.
Regards, --Sergey
Class will now be held in Building 370, room 370. Building 370 is in the Quad, on the side nearest Gates (sort of "behind" the Math corner if that helps). Room 370 has an entrance directly from outside, and is fairly clearly marked.
Reminder: the project proposal for cs349 is due by midnight, this Sunday (October 18). Please submit the proposal via email in plain-text.
What we expect in the project proposal: -- what you're trying to accomplish in the project (i.e., goals) -- why you're trying to accomplish that, what doing whatever you're doing is a gain over current technology, etc. (i.e., motivation) -- how you plan on reaching your goals (i.e., initial approach to the problem) -- when you plan on having what achieved (i.e., schedule, including the milestone) -- who's going to be doing what, approximately (i.e., work breakdown)
Not necessarily in this order, format, etc. But that's what I would recommend. And of course, what you propose, we may not necessarily agree with, but then again, this is just the proposal. :-)
--D
If you know who's going to be in your group, please send me email with: -- a group name -- who's in the group
and I will create an account for your group. I will send you email with the group confirmation, and you can get the passwords from me in class.
--D To access your account, please
ssh -l <groupname> trout.stanford.edu
ssh is the only way to access the machine.
I've moved all of the group home directories onto disks with space on them, so you should be able to actually save files and such now.
If you haven't picked up your passwords, please find me before class, during the break, or after class to pick up your password.
There will be a sign-up sheet in class to make appts to discuss your project proposal with Serge and Larry next week. Please make sure that your group signs up.
--D
Here is the current schedule, with approximate email addresses for each group. If you cannot make your appointment, you have the following options: 1. Email the group(s) that have the timeslot you want, and see if you can switch. If the switch occurs, email me as well. 2. Email me, tell me you can't make it, and give me some times that you could make it (not conflicting with the existing appointments), and we'll try to work something out with Serge and Larry.
Tuesday: 5:30: scriptfinder (kcsmilak... ryank) 5:45: etaoin (???) 6:00: Mark Toolis (willey, ??) 6:15: arcmakers (rambler, kaushal, sebbrion) 6:30: lkr (kirchoff, laskin, rokita) 6:45: vamp (vikash, singhala, mpdesai, pdharma) 7:00: vls (ladamic, svemuri, mehta) 7:15: sbs (kstevens, psully, reza)
Thursday: 5:30: asdf (elm, julialee, wesley) 5:45: yisun's group (????) 6:00: cmc (alanwong, tomtong, kenlaw) 6:15: learn (bogdanplayfair, paullocs, kuan, kyajima) 6:30: xyyy (xsyang, yuhualiu, yeewah, yxi) 6:45: baconbar (xliang, ms9, dmitrib, dbrussak) 7:00: tigers (onncs, ejang, leorawcs)
Wants appointment: vls (ladamic, svemuri, mehta) – can make Thursday, 2:00pm, or Tuesday/Thursday before class
Groups that I've created accounts for that I don't see on this schedule: -- stz (takaoki kenta weizhang)
Groups that I see on the schedule that have yet to email me for an accout: -- Mark Toolis (willey, ...) -- etaoin -- yisun's group
--D
Hi class,
I've finally made a chunk of the webbase available for you to play with.
Here are instructions:
login to trout
> cd ~ > mkdir webbase > cd webbase > cp -d /usr/local/webbase/* .
# note: the -d option copies the "data" symlink that you need
> make
# everything should compile with just a couple warnings
> ./process read cat | less
# you should see lots of web pages
Send any comments/problems to cs349-staffegroups.com
Good luck. --Sergey
The first milestone of your project is due next Thursday. The way it's going to work is that it'll both be a written summary (approximately 1 page), as well as a discussion with Serge and Larry after class on Thursday.
Please email me the summary, and have a copy printed out to bring to Serge and Larry on Thursday. We need both.
Secondly, please email me both 3 preferred times AND blocks of time that you can and cannot make it. The appointments will be from 5:30 to 9:30 in 15 minutes chunks in the pup lab in the basement of Gates. If everyone emails me saying that they want the 5:30, 5:45 or 6pm slots, that's not very useful, as that only lets me schedule 3 groups. Please, please send me both your preferred times AND blocks of time when you could and could not make it so that I'll have an easier time scheduling.
Thanks! --D
For meetings on Thursday (your milestone).
We're not scheduling anything at 5:30 to allow for transport time and questions after class.
If you are on the list as not being scheduled yet, please email me soon! --D
5:45: sbs (reza, kristian, peter) 6:00: xyyy (yinong, yuhua, xiaosong, yee wah) 6:15: vls (ladamic, svemuri, mehta) 6:30: learn (paul, bogdan, kuan, ken) 6:45: 7:00: 7:15: baconbar (dmitri, matt, daniel, xiaoli) 7:30: 7:45: 8:00 8:15: vamp (vikash, singhala, mehul, pdharma)
before class (due to flight out of town) 3:45: scriptfinder (ryan, kevin)
Not scheduled yet: -- arcmakers (rambler, kaushal, sebbrion) -- lkr (kirchoff, laskin, rokita) -- asdf (elm, julialee, wesley) -- cmc (alanwong tomtong kenlaw) -- cat5 (willey..., toolis) -- etaoin (takaoki kenta weizhang) -- webtek (yisuncs cchan jlin llo jchen) -- tigers (onncs, leoraw, ejang)
5:45: sbs (reza, kristian, peter) 6:00: xyyy (yinong, yuhua, xiaosong, yee wah) 6:15: vls (ladamic, svemuri, mehta) 6:30: learn (paul, bogdan, kuan, ken) 6:45: lkr (kirchoff, laskin, rokita) 7:00: tigers (onncs, leoraw, ejang) 7:15: baconbar (dmitri, matt, daniel, xiaoli) 7:30: cat5 (willey..., toolis) 7:45: asdf (elm, julialee, wesley) 8:00: arcmakers (rambler, kaushal, sebbrion) 8:15: vamp (vikash, singhala, mehul, pdharma) 8:30: etaoin (takaoki kenta weizhang)
before class (due to flight out of town) 3:30: webtek (yisuncs cchan jlin llo jchen) 3:45: scriptfinder (ryan, kevin)
Not scheduled yet: -- cmc (alanwong tomtong kenlaw)
I know that a lot of you have been having problems with compiling. I'm sorry that that has been happening. The first problem was that /tmp kept getting full so that there was no room to put intermediate object files. The second problem (I think) was that there wasn't enough memory available. I have since rebooted, and since rebooting, I haven't had any difficulties compiling. At some point this week, I will try to upgrade the compiler and associated libraries and other files to see if that will help as well.
In general, if you're getting an internal compiler error, that probably means that there's something external wrong (/tmp, memory, etc.). In the very near future, we'll be making additional machines available for your use.
In the mean time, try and cope as best you can, and I'll try to figure out why the compiler and machine are so intermittent. Sorry for the inconvenience!
--D A lot of people have asked for more information about using the Webbase. I have put together some helpful examples that show how to create new handlers for process and how to create parsehandlers.
The code in /usr/local/webbase has been updated. In particular, note that there are two examples at the end of handlers.cc. You can use diff /usr/local/webbase <your directory> to see everything that has changed.
The first is called ACounter and is a subclass of DocHandler. All it does is count the number of A's in a document. To run it use: process read http countas | less
here is the code:
/* Acounter below is a very simple example DocHandler that counts the number of A's in a document. */
class ACounter: public DocHandler { public: /* The function Handle below gets called for every document. */ int Handle(Document *doc) { char *s; int ret = 0;
if (!doc->body) { LOG(("No BODY found; Did you put 'http' in the process command line?")); return 0; } for (s = doc->body; *s; s++) if (*s=='a' || *s=='A') ret++;
printf("%d:%s has %d a'sn",doc->docid,doc->url,ret); return 1; } };
The second is called ParseTester and it is a subclass of ParseHandler. It prints out everything the parser encounters. To run it use: process read http testparser | less
Here is the code:
/* ParseTester below shows everything the parser parses out. */
class ParseTester: public ParseHandler { Document *doc; public: /* This is an example ParseHandler. It demonstrates all the capabilities of the parser. In you create yourown subclass of ParseHandler, you only need to define the functions that you need. */
void NewDocument(Document *d) { doc = d; printf("New Document: %d: %sn",doc->docid,doc->url); }; void EndDocument() {printf("End of Documentn");}; void AddTerm(char *word, int wordlen, int wordtype, int fontsize) { printf("Got Term: %sn", word); }; void AddNumber(char *word, int wordlen, int wordtype, int fontsize) { printf("Got Term: %sn", word); }; void AddTitle(char *title, int len) { printf("Got title: %.*sn",len,title);}; void AddBaseURL(char *base) { printf("Base URL: %sn",base);}; void NewAnchor(char *url, int len) { printf("Got Anchor: %sn",url); }; void AddImage(char *url) {printf("Got Image: %sn",url); }; void AddAnchorText(char *text, int texttype, int len) { printf("Got Anchor Text: %sn",text); }; void AnchorDone(char *anchor, int) { printf("Anchor Donen");};
/* The functions below have had very limited testing. use at your own risk. */
void NewForm(char *action) { printf("Got Form: %sn",action);}; void NewScript(char *language) { printf("Got Script: %sn",language);}; void NewApplet(char *code) { printf("Got Applet: %sn",code);}; void NewFrame(char *src) { printf("Got Frame: %sn",src);}; void MetaInfo(char *value, char *l) { printf("Got Meta: %sn", value);}; };
Also note the following addition in process.cc so that these can be called:
else if (strcmp(argv[i],"testparser")==0) { handlers[nhandlers++] = new HTMLParser(new ParseTester()); if (!httpparser) LOG(("WARNING: No HTTP Parser before ParseTester")); } else if (strcmp(argv[i],"countas")==0) { handlers[nhandlers++] = new ACounter(); if (!httpparser) LOG(("WARNING: No HTTP Parser before ACounter")); }
We'll let everyone run over the main webbase soon. That will have funcitonality like url2docid.
Regards, --Sergey
PS Also please note the COPYRIGHT file in the source directory.
I have just made another machine available to those who need more horsepower. It is an IBM with 4 processors and 512M of memory.
It is called lilblue and you should be able to log in with the same account and pw you use on trout.
You will need to update your code. Namely, Makefile, bigfile.c, and repository.cc have been updated. After that: ln -s /home/webbase/ibmdata ./data rm *.o # BE CAREFUL TYPING THIS make process # expect a lot of warnings /process read http testparser
Note: the IBM development environment is not nearly as nice as Linux but you have the entire repository available on lilblue and you have a lot of memory and CPU.
Also, if youdon't update repository.cc, then you will see a lot of dots when you run process.
More to come.
Regards, --Sergey
We have made some new link index files available on liblue. To see how to use them look at the new demo program readalink.c
--Sergey
Some students have pointed out a typo in a previous message. The links indexes are on lilblue (note the second l).
If you use lilblue, don't forget to set your data directory to the one pointed to by ibmdata.
Also as a number of people have discovered, the webbase code changes directory to the data directory. So relative paths in your code will be relative to the data dir.
-------------
We are providing an interface to the backend search server for Google since a number of groups have requested the functionality found there.
You can use: telnet palo.stanford.edu 3491
When connected, the following commands work.
d 10 search terms # will return docids of docs matching the search terms # output lines = docid pagerank relevance phrase/nophrase s 10 search terms # will produce web like search results but the formatting is tricky.
You can also use: d 10 link:doc s 10 link:doc d 10 flink:doc s 10 flink:doc
p doc # print the contents of a document
In the above a doc can be a url like www-db.stanford.edu/ (note the trailing /) or a docid.
If the search server crashes, please send email to cs349-staffegroups.com. Also, please don't tax it too heavily.
Cheers, --Sergey
I do not have milestone reports from the following groups: cmc lkr scriptfinder vls
I know some of you were asked to do some additional work before turning in the milestone but it's been a week now. Please email me your report.
Thanks! --Diane
Some of you have been asking what is expected of you at the end of the quarter. First, there is no final. Second, we do in fact expect you to prepare a presentation and a paper.
The paper should be in the same format as a publishable research format. For example: Abstract Introduction (Background, Motivation, Goals) Proposed Solution Results/Discussion (what worked, what didn't work, how well things worked, etc.) Conclusion/Discussion/Future Work optional: Related Work
The paper should preferably be either in Postscript format, or in HTML, and will be due Friday, December 4, 1998 at midnight, via email to me. If it's in HTML, email me the HTML and the URL, rather than just the URL.
The presentations will span class on December 1 and December 3. I'll arrange for a room with a computer with a net connection and project equipment. Hopefully, it'll be a PC that supports PowerPoint, but more details on that will come.
What I would like to know is if people would be willing to have class extend from 4pm to 6pm on Dec. 1 and Dec. 3 so that 15 minute presentations will fit in (You'll have a much easier time trying to present your project in 15 minutes rather that something more compressed, trust me). If you can, then I'll start setting up the schedule. Otherwise, we'll have to try to work for some alternate solution.
--Diane ---------------
---------------
Yes, yes,I know it's down – it crashed in the midst of me trying to upgrade the compiler. As soon as I find someone with a key to that office, I'll reboot it.
--Diane
is installed on trout. Should work, but you may have to tweak the libraries in the makefile.
http://graphics/infrastructure/howto.html
In the interest of fairness rather than first come first serve, I created a schedule for group presentations in the following way: each group was randomly assigned a 1 or a 3 (1 for presenting on Dec 1, 3 for presending on Dec 3), and then for the groups on a particular day, arranged them in alphabetical order.
If you cannot make the slot assigned to you (and this will only be applicable if you're presenting 5:30 or later), then I will swap your group with another group on the same day.
So here's the schedule:
Tuesday, December 1 4:30: asdf 4:45: cmc 5:00: learn 5:15: sbs 5:30: scriptfinder 5:45: vamp 6:00: vls 6:15 webtek
Thursday, December 3 4:30: arcmakers 4:45: baconbar 5:00: cat5 5:15: etaoin 5:30: lkr 5:45: tigers 6:00: xyyy
Shift everyone up by 15 minutes – for some reason, I thought class began at 4:30 rather than at 4:!5.
--D
[ Edited two lines to prevent the page formatting breaking – Sam] |