{"id":2240,"date":"2020-10-22T12:47:31","date_gmt":"2020-10-22T12:47:31","guid":{"rendered":"https:\/\/archives-library.wcsu.edu\/cao\/?page_id=2240"},"modified":"2024-08-05T19:07:01","modified_gmt":"2024-08-05T19:07:01","slug":"direct-harvest-from-an-archivesspace-instance","status":"publish","type":"page","link":"https:\/\/archives-library.wcsu.edu\/cao\/direct-harvest-from-an-archivesspace-instance\/","title":{"rendered":"Direct harvest from an ArchivesSpace instance"},"content":{"rendered":"<div class=\"entry\">\n\n\n<p class=\"wp-block-paragraph\">There are two ways we can harvest EADs from an ArchivesSpace instance: we can pull them from your server, or you can push them to our server.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Option 1: We pull<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The CAO runs a script on our server to connect to your ArchivesSpace instance and download your EADs to our server. We will need 2 things in order to accomplish this:<\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>We will need to have login credentials to your system<\/li>\n\n\n\n<li>You will need to let outside traffic into your ArchivesSpace backend (https:\/\/yourASpace.org:8089 &#8211; for example).<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">So, to get started.  <\/p>\n\n\n\n<ol class=\"wp-block-list\">\n<li>Let us know you&#8217;re interested, and fill out the <a href=\"https:\/\/archives-library.wcsu.edu\/cao\/participation\/application\/\" data-type=\"page\" data-id=\"1961\">application to participate<\/a><\/li>\n\n\n\n<li>Send us your ASpace backend URL &#8211; by default, it&#8217;s the URL to your staff interface with port:8089.<\/li>\n\n\n\n<li>You need to create a basic user for us on your ArchivesSpace instance and let us know the credentials<\/li>\n\n\n\n<li>You need to let our machine past your firewall (we&#8217;ll give you our IP).  This site may help with Linux machines. <a href=\"https:\/\/www.tecmint.com\/open-port-for-specific-ip-address-in-firewalld\/\">https:\/\/www.tecmint.com\/open-port-for-specific-ip-address-in-firewalld\/<\/a><\/li>\n<\/ol>\n\n\n\n<h3 class=\"wp-block-heading\">What will happen?<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Every day we will scan your ASpace instance to see if any resource records have changed.   If they have, your system will export an EAD and we will add the file to your data directory on CAO.  We use ArchivesSnake to accomplish this.&nbsp; Info on ArchivesSnake is here:&nbsp;<a href=\"https:\/\/github.com\/archivesspace-labs\/ArchivesSnake\">https:\/\/github.com\/archivesspace-labs\/ArchivesSnake<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Why bother?  <\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">It&#8217;s a &#8220;set it and forget it&#8221; option.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Good to know<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The export script requires some fields for a successful export. Please make sure your finding aids have:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>title<\/li>\n\n\n\n<li>abstract<\/li>\n\n\n\n<li>user agent with the role &#8220;creator&#8221;<\/li>\n\n\n\n<li>biographical\/historical&nbsp;note&nbsp;<\/li>\n\n\n\n<li>scope\/content&nbsp;note<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">More details can be found on the page <a href=\"https:\/\/archives-library.wcsu.edu\/cao\/new-coding-conventions\/\" data-type=\"page\" data-id=\"2218\">Coding Conventions.<\/a><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Option 2: You Push<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">If you can&#8217;t or don&#8217;t want to give us login credentials to your ArchivesSpace instance, you can instead export and upload the EADs from your server to our server using Webdav.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">One way to do this is to use the ArchivesSnake and webdavclient3 python modules. <\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Installation of <a href=\"https:\/\/github.com\/archivesspace-labs\/ArchivesSnake\">ArchivesSnake<\/a> can be found here: <a href=\"https:\/\/github.com\/archivesspace-labs\/ArchivesSnake\">https:\/\/github.com\/archivesspace-labs\/ArchivesSnake<\/a>.  It&#8217;s pretty straight forward.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You then will need to run a python script on your server to accomplish the &#8220;push.&#8221;  Below is a sample script you can use (<a href=\"https:\/\/github.com\/UAlbanyArchives\/ArchivesSpace-ArcLight-Workflow\/blob\/master\/exportPublicData.py\">based off of the University of Albany&#8217;s script<\/a>), just replace the fakeUsername, mainAgencyCode, and fakePasswords.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Sample Script <\/h3>\n\n\n\n<pre class=\"wp-block-code\"><code># -*- coding: utf-8 -*-\nimport os\nimport sys\nimport dacs\nimport time\nimport csv\nimport shutil\nfrom git import Repo\nfrom datetime import datetime\nfrom subprocess import Popen, PIPE, STDOUT\nimport asnake.logging as logging\nfrom asnake.client import ASnakeClient\n\n#setting up connection to WCSU webdav\nfrom webdav3.client import Client\noptions = {\n 'webdav_hostname': \"http:\/\/archives.library.wcsu.edu\/webdav\/arclight\/data\/ead\/mainAgencyCode\/\",\n 'webdav_login':    \"yourMainAgencyCode\",\n 'webdav_password': \"fakePassword\"\n}\nwebDavClient = Client(options)\nwebDavClient.verify = False # To not check SSL certificates (Default = True)\n\n\nprint (str(datetime.now()) + \" Exporting Records from ArchivesSpace\")\n\nprint (\"\\tConnecting to ArchivesSpace\")\n\nclient = ASnakeClient(baseurl=\"http:\/\/localhost:8089\",\n                      username=\"fakeUsername\",\n                      password=\"fakePassword\")\nclient.authorize()\nlogging.setup_logging(stream=sys.stdout, level='INFO')\n\n#repo = ASpace().repositories(3)\n\n__location__ = os.path.dirname(os.path.realpath(__file__))\n\nlastExportTime = time.time()\ntry:\n    timePath = os.path.join(__location__, \"lastExport.txt\")\n    with open(timePath, 'r') as timeFile:\n        startTime = int(timeFile.read().replace('\\n', ''))\n        timeFile.close()\nexcept:\n    startTime = 1592193600\n\nhumanTime = datetime.utcfromtimestamp(startTime).strftime('%Y-%m-%d %H:%M:%S')\nprint (\"\\tChecking for collections updated since \" + humanTime)\n    \noutput_path = \"\/home\/haponiks\/archivesSnakeExport\/eads\"\nstaticData = os.path.join(output_path, \"staticData\")\n\n#read existing exported collection data\ncollectionData = &#91;]\n#collectionFile = open(os.path.join(staticData, \"collections.csv\"), \"r\", encoding='utf-8')\n#for line in csv.reader(collectionFile, delimiter=\"|\"):\n#    collectionData.append(line)\n#collectionFile.close()\n\n#read existing exported subject data\nsubjectData = &#91;]\n#subjectFile = open(os.path.join(staticData, \"subjects.csv\"), \"r\", encoding='utf-8')\n#for line in csv.reader(subjectFile, delimiter=\"|\"):\n#    subjectData.append(line)\n#subjectFile.close\n\nprint (\"\\tQuerying ArchivesSpace...\")\nmodifiedList = client.get(\"repositories\/3\/resources?all_ids=true&amp;modified_since=\" + str(startTime)).json()\nif len(modifiedList) &gt; 0:\n    print (\"\\tFound \" + str(len(modifiedList)) + \" new records!\")\n    print (\"\\tArchivesSpace URIs: \" + str(modifiedList))\nelse:\n    print (\"\\tFound no new records.\")\nfor colID in modifiedList:\n    collection = client.get(\"repositories\/3\/resources\/\" + str(colID)).json()\n    if collection&#91;\"publish\"] != True: \n        print (\"\\t\\tSkipping \" + collection&#91;\"title\"] + \" because it is unpublished\")\n    else:\n        print (\"\\t\\tExporting \" + collection&#91;\"title\"] + \" \" + \"(\" + collection&#91;\"id_0\"] + \")\")\n    \n        try:\n            normalName = collection&#91;\"finding_aid_title\"]\n        except:\n            print (\"\\t\\tError: incorrect Finding Aid Title (sort title)\")\n            normalName = collection&#91;\"finding_aid_title\"]\n        \n        #DACS notes\/fields to check before exporting\n        dacsNotes = &#91;\"ead_id\", \"abstract\", \n#\"acqinfo\", \n\"bioghist\", \n\"scopecontent\", \"arrangement\", \"creator\"]\n        checkDACS = {}\n        for dacsNote in dacsNotes:\n            checkDACS&#91;dacsNote] = False\n        checkAccessRestrict = False\n        abstract = \"\"\n        accessRestrict = \"\"\n        \n        if \"ead_id\" in collection.keys():\n            checkDACS&#91;\"ead_id\"] = True\n            \n        for note in collection&#91;\"notes\"]:\n            if \"type\" in note.keys():\n                if note&#91;\"type\"] == \"abstract\":\n                    checkDACS&#91;\"abstract\"] = True\n                    abstract = note&#91;\"content\"]&#91;0].replace(\"\\n\", \"&amp;#13;&amp;#10;\")\n                if note&#91;\"type\"] == \"accessrestrict\":\n                    checkAccessRestrict = True\n                    for subnote in note&#91;\"subnotes\"]:\n                        accessRestrict = \"&amp;#13;&amp;#10;\" + subnote&#91;\"content\"].replace(\"\\n\", \"&amp;#13;&amp;#10;\")\n                    accessRestrict = accessRestrict.strip()\n                if note&#91;\"type\"] == \"acqinfo\":\n                    checkDACS&#91;\"acqinfo\"] = True\n                if note&#91;\"type\"] == \"bioghist\":\n                    checkDACS&#91;\"bioghist\"] = True\n                if note&#91;\"type\"] == \"scopecontent\":\n                    checkDACS&#91;\"scopecontent\"] = True\n                if note&#91;\"type\"] == \"arrangement\":\n                    checkDACS&#91;\"arrangement\"] = True\n                    \n                    \n        for agent in collection&#91;\"linked_agents\"]:\n            if agent&#91;\"role\"] == \"creator\":\n                checkDACS&#91;\"creator\"] = True\n        \n        checkExport = all(value == True for value in checkDACS.values())\n        if checkDACS&#91;\"abstract\"] != True:\n            print (\"\\t\\tFailed to update browse pages: Collection has no abstract.\")\n            print (\"\\t\\tFailed to export collection: Collection has no abstract.\")\n        else:\n            date = \"\"\n            for dateData in collection&#91;\"dates\"]:\n                if \"expression\" in dateData.keys():\n                    date = dateData&#91;\"expression\"]\n                else:\n                    if \"end\" in dateData.keys():\n                        normalDate = dateData&#91;\"begin\"] + \"\/\" + dateData&#91;\"end\"]\n                    else:\n                        normalDate = dateData&#91;\"begin\"]\n                    date = dacs.iso2DACS(normalDate)\n            extent = \"\"\n            for extentData in collection&#91;\"extents\"]:\n                extent = extentData&#91;\"number\"] + \" \" + extentData&#91;\"extent_type\"]\n\n            ID = collection&#91;\"id_0\"].lower().strip()\n            eadID = collection&#91;\"ead_id\"].strip()\n            checkCollection = False\n            if checkCollection == False:\n                collectionData.append(&#91;ID, checkExport, normalName, date, extent, abstract, collection&#91;\"restrictions\"], accessRestrict])\n\n            for subjectRef in collection&#91;\"subjects\"]:\n                subject = client.get(subjectRef&#91;\"ref\"]).json()\n                if subject&#91;\"source\"] == \"meg\":\n                    if subject&#91;\"terms\"]&#91;0]&#91;\"term_type\"] == \"topical\":\n                        checkSubject = False\n                        for existingSubject in subjectData:\n                            if existingSubject&#91;0] == subject&#91;\"title\"]:\n                                if not ID in existingSubject:\n                                    existingSubject.append(ID)\n                                checkSubject = True\n                        if checkSubject == False:\n                            subjectData.append(&#91;subject&#91;\"title\"], subjectRef&#91;\"ref\"], ID])    \n            if checkExport != True:\n                print (\"\\t\\tFailed to export collection: \")\n                for checkNote in checkDACS.keys():\n                    if checkDACS&#91;checkNote] == False:\n                        print (\"\\t\\t\\t\" + checkNote + \" is missing\")\n            else:\n\n                #sorting collection\n                eadDir = output_path\n                if not os.path.isdir(eadDir):\n                    os.mkdir(eadDir)            \n            \n                resourceID = collection&#91;\"uri\"].split(\"\/resources\/\")&#91;1]\n                print (\"\\t\\t\\tExporting EAD to \" + eadID+\".xml\")\n                eadResponse = client.get(\"repositories\/3\/resource_descriptions\/\" + resourceID + \".xml?numbered_cs=true&amp;include_daos=true\")\n                eadFile = os.path.join(eadDir, eadID + \".xml\")\n                f = open(eadFile, 'w', encoding='utf-8')\n                f.write(eadResponse.text)\n                f.close()\n                print (\"\\t\\t\\tSuccess!\")\n\n\t\t#uploading to WCSU webdav\n                remote_path= os.path.join(\"\/\" + eadID + \".xml\")\n                webDavClient.upload_sync(remote_path=remote_path, local_path=eadFile)\n\nprint (\"\\tWriting static data back to files.\")\n#write new collection data back to file\ncollectionFile = open(os.path.join(staticData, \"collections.csv\"), \"w\", newline='', encoding='utf-8')\nwriter = csv.writer(collectionFile, delimiter='|')\nwriter.writerows(collectionData)\ncollectionFile.close()\n\n#write new subjects data back to file\nsubjectFile = open(os.path.join(staticData, \"subjects.csv\"), \"w\", newline='', encoding='utf-8')\nwriter = csv.writer(subjectFile, delimiter='|')\nwriter.writerows(subjectData)\nsubjectFile.close()\n\nendTimeHuman = datetime.utcfromtimestamp(lastExportTime).strftime('%Y-%m-%d %H:%M:%S')\nprint (\"\\tFinished! Last Export time is \" + endTimeHuman)\ntimePath = os.path.join(__location__, \"lastExport.txt\")\nwith open(timePath, 'w') as timeFile:\n    timeFile.write(str(lastExportTime).split(\".\")&#91;0])\n    timeFile.close()\n<\/code><\/pre>\n\n\n\n<p class=\"wp-block-paragraph\"><\/p>\n\n\n<\/div>","protected":false},"excerpt":{"rendered":"<p>There are two ways we can harvest EADs from an ArchivesSpace instance: we can pull them from your server, or you can push them to our server. Option 1: We pull The CAO runs a script on our server to connect to your ArchivesSpace instance and download your EADs to our server. We will need &#8230;<\/p>\n","protected":false},"author":1,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-2240","page","type-page","status-publish"],"_links":{"self":[{"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/pages\/2240","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/comments?post=2240"}],"version-history":[{"count":0,"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/pages\/2240\/revisions"}],"wp:attachment":[{"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/media?parent=2240"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/categories?post=2240"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/archives-library.wcsu.edu\/cao\/wp-json\/wp\/v2\/tags?post=2240"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}