code_share_flickr_2008_08_18
所属分类:其他
开发工具:matlab
文件大小:3140KB
下载次数:40
上传日期:2009-06-04 00:28:53
上 传 者:
notimetokill
说明: 一个从flickr上批量下载图片的程序,利用flickr的Python接口和Matlab接口,对需要大量图片的项目很有帮组
(a program to download image for flickr,use the python interface and matlab interface of flickr,which do great help to the project with large number of pitures)
文件列表:
download_imgs\rescale_max_size.m (1195, 2007-10-13)
download_imgs\trim_border.m (341, 2007-08-21)
download_imgs\temp (0, 2007-11-17)
download_imgs\cvlib_mex.mexglx (100282, 2007-04-29)
download_imgs\downloadphotos_gps.m (12801, 2008-04-29)
download_imgs\fast_resize.m (913, 2008-04-29)
download_imgs\libcv.so.1 (3980937, 2007-04-26)
download_imgs\libcxcore.so.1 (3238114, 2007-04-26)
download_imgs\my_rgb2gray.m (302, 2008-04-29)
download_imgs\notes.txt (197, 2007-09-25)
download_imgs\remove_frame.m (2514, 2007-10-13)
download_imgs (0, 2008-04-29)
query_imgs\get_imgs_dyn_timeskip.py (11928, 2008-04-29)
query_imgs\place_rec_queries.txt (6643, 2007-09-26)
query_imgs\flickrapi2.py (16781, 2008-08-18)
query_imgs (0, 2008-08-18)
This code was original written by Tamara Berg, then extended by
James Hays (jhhays@cs.cmu.edu)
updated - 8/18/2008 - In previous archives, I had included
flickrapi.py instead of flickrapi2.py which handles missing XML
fields.
The code, as is, was used to generate the geotagged database for
IM2GPS: estimating geographic information from a single image.
James Hays and Alexei A. Efros. CVPR 2008.
http://graphics.cs.cmu.edu/projects/im2gps/
The code operates in two distinct stages:
first querying for images, then downloading them.
--------------------
1) Querying
--------------------
The image query code is written in Python, using the Python Flickr
API interface which gives acknowledgements and credits in the top
of flickrapi2.py. To use the Flickr API, you need an API key (see
http://www.flickr.com/services/api/) which you will need to enter
in get_imgs_dyn_timeskip.py (line 31) in addition to changing the
output path (line ***).
The query script, get_imgs_dyn_timeskip.py, searches for Flickr
images with keywords listed in place_rec_queries.txt.
place_rec_queries.txt contains negative text constraints, as well,
at the bottom of the file. You can change these to whatever
keywords you want. For each keyword get_imgs_dyn_timeskip.py will
produce a text file containing information about the images found.
Keywords can be either tags or, as currently specified, any text
associated with the image including title and description.
Querying FAQ
a) I don't care about geo-tagged images, I want them all.
As is, get_imgs_dyn_timeskip.py will only retrieve geotagged
images. To retrieve all images delete the
bbox="-180, -90, 180, 90",
accuracy="6" "
constraints to the Flickr search API calls (lines 114 and 175).
b) Why does get_imgs_dyn_timeskip.py look so complicated / run so
slow?
The Flickr API will disable your key if you query too rapidly, so
it makes sense to do large queries which return hundreds of
results. But doing big queries is problematic, because there
seems to be a long existing and long known bug in the Flickr
search function- for any given search, after the 1500th or so
image, duplicates will start to appear. You can get around this
by doing time-bounded queries, but then you run into the problem
of having to do too many small queries.
Therefore get_imgs_dyn_timeskip.py does queries within dynamically
sized time intervals, always trying to have about 400 results for
a query. If few images are being found because you've done a rare
query, the time interval for queries will tend to expand. If too
many images are being found the time interval will shrink. As the
time window moves towards the present day it will tend to get
narrower, because the rate that people upload pictures to Flickr
seems to be increasing.
c) Will I get duplicate images?
No, actually. Lets say your keyword file is
# first line of file
mountain
lake
-city
# end of file
First the script will find all of the images that have the text
"mountain" associated with them, but not "city". Then for the
next keyword "mountain" will be used as a negative constraint.
So the queries will actually be "mountain -city" then
"lake -city -mountain", etc. Images that have both "mountain" and
"lake" will be found with the first query and not the second.
d) Can I run the script in parallel to speed things up?
It's possible, but it's more likely to get your API key disabled
and you'll have to take steps to ensure that you're not getting
duplicate images, for instance by manually adding the keywords
that one thread is using as negative constraints to the other
threads.
e) Couldn't I just check for duplicates after querying before
downloading?
Sure, although I haven't written code to do that. Flickr images
have unique serial numbers so it's possible.
--------------------
2) Downloading
--------------------
The image download code is written in Matlab. It accesses images
on Flickr's http server instead of going through the API, and thus
doesn't require an API key. It reads the text files produced by
get_imgs_dyn_timeskip.py, downloads the photo, and saves all of
the image attributes (tags, interestingness, long/lat, etc...) as
a matlab cell string array in the comment field of each jpg. Use
imfinfo() to read them later.
The main function is downloadphotos_gps.m. You'll need to set the
paths at the top. if you've changed any of the fields saved by
they query script you'll need to edit the download script since it
expects certain fields in certain order (see the source code for
an example).
Downloading FAQ
a) What size images will this get?
Currently the code will try and find the Flickr "Large" size
photo, which has max width or height of 1024. Failing that it
will try to get the "Original" size photo. If the "Original" is
larger than 1024 height/width it will be downsampled to 1024. If
it is smaller than 500 height or width it will be thrown away.
Otherwise the image will be kept.
A significant fraction of images are too small by this criteria
and thus are thrown away.
b) How are the images written to disk?
Since most file systems have trouble with thousands of files in a
directory, the images are put into a hierarchy of directories that
contain no more than 1000 images each. The hierarchy is
base_db_path / keyword / numbered subdir / img_name
for example
Flickr_geo_and_gps/Argentina/00015/315157387_c36ba74681_100_23812473@N00.jpg
The image filenames contain the photo id, secret, server id, and
owner which can be used to trace the .jpg back to its source on
Flickr. See the source code for examples of how the URLs are
constructed.
c) What about all those annoying artsy borders that people put
around their Flickr photos?
The download script tries to remove them, see remove_frame.m
d) Can I run the download script in parallel?
Yes, I've run 15 copies in parallel in the past. I wouldn't
recommend doing any more than this because Flickr could get mad at
us. They're aware that researchers are using Flickr as a data
source but their main concern is that we don't impact the quality
of service for the millions of people who use Flickr.
To run multiple scripts in parallel you'll need to split up the
text files from the query process manually, then change the path
in downloadphotos_gps.m for each call.
e) What are these libraries and mex files?
Matlab's image resizing is slow so I call the open_cv image resize
through a wrapper. If that's not working you can just change the
function calls to fast_resize to Matlab's slower imresize().
f) What about copyrights?
It is worth noting that Flickr allows photographers to specify
Creative Commons licenses for their images instead of the default
"all rights reserved". This script saves the license info with
each .jpg file, so you can pick out Creative Commons images after
the fact (in my experience it's less than 10% of images) It is
also possible to restrict the search to images with certain
licenses at query time. See the Flickr API for details.
近期下载者:
相关文件:
收藏者: