A Google Custom Search Engine searching the full text of research published in Australian university repositories:
What does this do? How does it work?
Most Australian universities manage repositories of their publications. Google may crawl those repositories and index the text. The Google Custom Search Engine lets anyone scope a search over just a specific part of the entire world wide web. This tool uses the Google Custom Search Engine to limit a search to just some Australian university repositories, more or less.
How does it know where to look?
I've given it a list of URL patterns (see below).
This list is almost certainly incomplete, so if you know how I can
improve it email
me at kent.fitch@projectcomputing.com. The original list was
seeded from links found here
and then by running Google searches such as
site:xxx.edu.au filetype:pdf
and looking for interesting patterns.
How is this related to ARROW?
It isn't, really. ARROW is program aimed at helping Australian universities establish repositories which also regularly collects the metadata describing the contents of those repositories (not the full text), and "pushes" what it finds to Google, which reads and indexes those contents and other pages it discovers as a result of that processing.
The ARROW Discovery services lets you search the collected metadata. Its results are relevance ranked based on the occurrence of your search term in the collected metadata. But it doesn't "harvest" or index the full text. So unless your search term appears in the metadata of a resource, it won't find it.
This tool, however, searches the full-text of resources Google has crawled and their link metadata (the text contents of hyperlinks on the web which point to these resources). Relevance ranking is based on Google-magic, which includes occurrence of your search term in the full text contents and incoming links as well as the number of incoming links and their pagerank. For result sets comprised of resources with few incoming links, ranking is rather poor.
The two approaches will find slightly different sets of resources and present them in a different order, for example, compare:
Can the two approaches be combined?
Of course! If searching on a subset of research outputs from the university sector by national boundary is important, I guess they will be!
Why do people want to limit searches to Australian university outputs placed in specific Australian university repositories?
Apart from bean-counters, probably they don't. It is very important that research "outputs" are made public and can be discovered - this is an important goal of ARROW. But whether it is useful to restrict discovery to some fraction of materials produced within a national boundary, reduced further to those materials uploaded to an Australian university repository allied with the ARROW service, then discovered and indexed by Google, is another question...
URL patterns search by this tool
Note: Google does not (yet) index the contents of all these
URLs. Some may be excluded from Google's view with a robots.txt
configuration. Others may not be crawlable, or may not be linked from the
outside web, which is one of the tasks ARROW is hoping to
perform.
ACT www.library.unsw.edu.au/~thesis/adt-ADFA/uploads/* erl.canberra.edu.au/* _ thesis.anu.edu.au/uploads/ dspace.anu.edu.au/bitstream/* dspace.anu.edu.au/html/* dlibrary.acu.edu.au/digitaltheses/* NSW epubs.scu.edu.au/cgi/* library.uws.edu.au/adt-NUWS/uploads/* arrow.uws.edu.au:8080/vital/access/manager/Repository/* *.une.edu.au/*article*.pdf *.une.edu.au/*publications*.pdf *.une.edu.au/*Preprints*.pdf *.une.edu.au/*Report*.pdf www.researchonline.mq.edu.au:9080/vital/access/manager/Repository/* www.library.uow.edu.au/adt-NWU/uploads/* ro.uow.edu.au/cgi* www.library.unsw.edu.au/~thesis/adt-NUN/uploads/* unsworks.unsw.edu.au/vital/access/manager/Repository/* unsworks.unsw.edu.au/vital/access/services/Download/* epress.lib.uts.edu.au/dspace/html/* epress.lib.uts.edu.au/dspace/bitstream/* ses.library.usyd.edu.au/bitstream/* www.newcastle.edu.au/services/library/adt/uploads/* ogma.newcastle.edu.au:8080/vital/access/manager/Repository* csu.edu.au/research/*.pdf QLD eprints.usq.edu.au/*.pdf eprints.usq.edu.au/* adt.library.qut.edu.au/adt-qut/uploads/* eprints.qut.edu.au/archive/* adt.library.uq.edu.au/public/* eprint.uq.edu.au/archive/* eprints.jcu.edu.au/*.pdf eprints.jcu.edu.au/* www4.gu.edu.au:8080/adt-root/uploads/* www98.griffith.edu.au/dspace/html/* www98.griffith.edu.au/dspace/bitstream/* research.usc.edu.au/vital/access/manager/Repository/* library-resources.cqu.edu.au/thesis/* acquire.cqu.edu.au:8080/vital/access/manager/Repository/* epublications.bond.edu.au/context/* epublications.bond.edu.au/cgi/* VIC eprints.infodiv.unimelb.edu.au/archive/* digthesis.ballarat.edu.au/adt/uploads* wallaby.vu.edu.au/adt-VVUT/uploads/* eprints.vu.edu.au/archive/* adt.lib.swin.edu.au/uploads/* researchbank.swinburne.edu.au/vital/access/manager/Repository/* researchbank.swinburne.edu.au/vital/access/services/* adt.lib.rmit.edu.au/adt/uploads/* arrowprod.lib.monash.edu.au/vital/access/services/Download/* eprint.monash.edu.au/* alpha3.latrobe.edu.au/thesis/uploads/* tux.lib.deakin.edu.au/adt-VDU/* WA espace.lis.curtin.edu.au/archive/* espace.lis.curtin.edu.au/archive/*.pdf adt.curtin.edu.au/theses/available/* ro.ecu.edu.au/rqf_submissionsfedrt/* portal.ecu.edu.au/adt-public* wwwlib.murdoch.edu.au/adt/pubfiles/* *.uwa.edu.au/*article*.pdf theses.library.uwa.edu.au/adt-* SA digital.library.adelaide.edu.au/dspace/bitstream/* digital.library.adelaide.edu.au/dspace/html/* dspace.flinders.edu.au/dspace/bitstream/* dspace.flinders.edu.au/dspace/html/* catalogue.flinders.edu.au/local/adt/uploads/* TAS eprints.utas.edu.au/* eprints.utas.edu.au/*.pdf
Thanks to the following people for updates:
Kent Fitch, Project Computing
Project Computing Pty Ltd ACN: 008 590 967 | contact@projectComputing.com |